DIRECTIONS e-journal

Vassilis Hartzoulakis
ITEM ANALYSIS IN THE KPG EXAMS

The KPG exam system conforms to a number of norms that secure the highest possible test reliability and validity. Some of them have already been discussed in previous articles (e.g. oral examiner training, script rater training, evaluation criteria for the writing test, assessment criteria for the listening test etc.). This article will focus on the criteria for the selection of items in the reading and listening comprehension tests and the subsequent statistical analyses performed by the KPG exam team.

Designing and setting up a test for the KPG exam battery is a painstaking complex process involving a number of stages. After the initial design of a task and its related test items, a small scale pre-test is carried out; this provides feedback as to the appropriateness of the task in terms of content and language level. Items in a task or even full tasks that do not fit in the test content specifications or the specifications for the test tasks, as defined by the KPG examination board, are either revised or dropped entirely.

This first round of checks is followed by subsequent rounds with associated revisions of tasks and items until the criteria for content and language level are fully met. Next, the selected items are piloted with a sample of test takers bearing more or less the same characteristics as the candidates expected in the actual exam. The results yielded go through a series of checks for the statistical characteristics of the items. These checks fall under the category of “classical item analysis” (called 'classical' because it is based on classical test theory) and include:

1. a check for the desired level of difficulty of the item (referred to as “p” index),

2. a check for its desired discrimination power (referred to as “d” index)

3. a series of checks for the performance of the distracters and

4. a series of checks for the overall reliability of the test.

Depending on the findings of the item analysis, further adjustments are made to ensure the appropriate difficulty level for each item (and consequently for the relevant task) and the appropriate item discrimination power (or in other words, the item’s power to distinguish between high achievers and low achievers). For criterion-referenced tests such as the KPG tests, we usually follow a “rule of thumb” of selecting items whose difficulty index falls within a range of p-values between .20 and .80 on a scale of 0.0-1.0 (Bachman, 2004). Items below .20 are considered “too difficult” (in practice it means that less than 20% of the test takers answered the item correctly)and items over .80 (which in turn means that more than 80% of the test takers answered correctly) are considered “too easy”.

When an item is found to be “too difficult” or “too easy”, then it is revised. For example, in multiple choice items one of the distracters might be changed so that the distracter becomes a more obvious ‘wrong’ or ‘right’ choice. The same applies for True/False questions where one question might be rephrased so that it becomes more obviously true or not. However, there might be cases when an item or sometimes a full task is dropped completely and it is replaced by another one which in turn goes through the same series of checks.

The internal consistency reliability of the test is maximised by including items that have large discrimination (“d”) indices. A common rule of thumb is to include items that have discrimination indices equal to or greater than .30 on a scale of 0.0-1.0.

Distracters are checked to see whether they performed as expected. There is a minimum number of test takers that should select any of the wrong distracters (Τσοπάνογλου, 2000). If a distracter is not chosen by any of the subjects (or by very few) then this distracter did not ‘distract’ anyone, therefore it is replaced with another one. The same applies for distracters that were chosen by too many test takers. This means that the correct answer was not so obvious (depending on the language level tested) and then either the malfunctioning distracter is changed into something more obviously wrong or the correct choice is changed into something more obviously correct.

When this process is over, the items are ready to be included in the final test. Before administering any test, the expected level of difficulty of the test as a whole is checked, and if it does not match our target, other items are selected, making sure that these fit in terms of their content and level of difficulty.

Classical item analysis has its own limitations. One is that score statistics obtained from pre-tests are dependent on the sample of test takers who take the test. This means that if we administer the items to different groups of test takers, the items may have different statistical characteristics. This is true for the administration of the test to the actual candidates who are a much larger population with varying characteristics from period to period. This is why for every KPG test administration the KPG team performs post-test item analyses, to get feedback on the level of difficulty of the test and on the internal consistency of the test. These analyses help improve the reliability of the test and diagnose why items failed to function appropriately. Due to the acknowledged limitations of classical item analysis, there invariably are fluctuations in the overall difficulty level of the administered test and in its reliability index.

If we take a look at chart 1 which shows the timeline of the difficulty index averages for module 1 (reading comprehension) for all levels, we can see that all administered tests fell, on average, within a range of .50 to .75. Even though within the tests themselves one might find items falling outside this band, this overall average conforms to the target difficulty index for all levels and all languages in the KPG examination system which has been set to fall within range of p= 0.55 to p=0.80.

Chart 1: Difficulty Index (p) for all levels

Chart 2 shows the average discrimination index for the items in module 1 for all periods that each level was administered. We can see that the trend lines converge on an average discrimination index of around .50 demonstrating a clear upward tendency, which is considered a quite high discrimination index that adds to the overall reliability of the test system.

Chart 2: Discrimination Index (d) for all levels

The reliability of the test itself is measured through the Cronbach’s A index and it refers to the extent to which the test is likely to produce consistent scores. High reliability means that students who answered a given question correctly were more likely to answer other questions correctly. Bachman (2004) argues that high reliability should be demanded in situations in which a single test score is used to take major decisions. One should aim at reliability indices over .80, especially in large—scale exams. Chart 3 gives us an indication of the indices achieved in the reading comprehension module of the examination for the B2 level –which is the one administered for the longest time. All figures are found in the area over .80, averaging at .89, which is an index reflecting a highly reliable test.

Chart 3: B2 level, Module 1, Reliability Index

All in all, checks and statistical analyses are performed both before and after any test is administered. Pre-testing secures that the test conforms to the prerequisites in terms of content and language level, and post-testing provides us with feedback on how well (or badly) items functioned. This feedback is invaluable as it helps the KPG team shape the test in terms of level of difficulty and power of discrimination starting from the item level through to the task level until the whole test displays the characteristics that the KPG examination board has set.

References

Bachman, L. F. (2004). Statistical Analyses for Language Assessment. Cambridge: Cambridge University Press

Τσοπάνογλου, Α. (2000). Μεθοδολογία της Επιστημονικής Έρευνας και Εφαρμογές της στην Αξιολόγηση της Γλωσσικής Κατάρτισης. Θεσσαλονίκη: Εκδόσεις Ζήτη.

[Back]