Home Page   About DIRECTIONS   Editorial Board   Contact Us   For Authors    

Current Issue




Vassilis Hartzoulakis


This paper examines the extent to which KPG English script raters agree on their marking decisions and it reports on a study whose aim was to investigate the effectiveness of the instruments used for the rating process as designed by the KPG English team of experts. The data of this study was the marks given by different raters to A1+A2, B1, B2 and C1 level scripts of candidates who took the KPG exam in English between 2005 and 2008. The data was obtained at the Athens Script Rating Centre and analyzed in terms of reliability in scoring decisions. The analysis of the data of each examination period shows that there are high correlations in marks given by different raters; that is, the inter-rater reliability index is constantly above r =.50. The investigation also indicates that the degree of correlation is dependent on the level of the scripts being marked. That is to say, scores for B1 level scripts are in greater agreement than those for B2 level scripts, and so on. However, on the whole, the investigation shows that raters apply the evaluation criteria and use the rating grid effectively, demonstrating statistically significant high correlations across different exam periods.

Keywords: Testing; Script rating; Validity; Inter-rater reliability; Rater agreement; Correlation

1. Introduction

In classical test theory, the rating or observed score of a writing sample (henceforth script) is an estimate of the true proficiency (or true score) that is exhibited in the sample (Penny, Johnson, & Gordon, 2000). Any differences between the observed score and the true score are attributed to what is generally referred to as measurement error of which there are many types. In all measurement procedures there is the potential of error, so the aim is to minimize it. An observed test score is made up of the true score plus measurement error. The goal of all assessment agencies is to produce an accurate estimate of the true score. Several measures are usually taken in that direction. One is the employment of multiple raters to produce multiple ratings of the same script and then combine those ratings into a single observed score. The averaging process is expected to produce an observed score that is closer to the true score by the cancellation of random measurement error (Penny et al., 2000). Moreover, it is expected that the training of the raters together with their experience and expertise, will reduce the measurement error and result in better agreement on the level of proficiency that is exhibited in a given script. This agreement between raters is known as inter-rater reliability or rater agreement, and it is one of the estimates for the overall reliability of a test. In order to be valid, a test must be reliable; however, reliability does not guarantee validity. Reliability is the extent to which a test yields consistent scores when it is repeated.

There are a number of different methods for checking the reliability of a test (Professional Testing, 2006; Trochim, 2000; Alderson et al., 1995). The traditional way, according to Alderson et al. (1995), is to administer the test to the same group of people at least twice and correlate the first set of scores with the second set of scores. This is known as test-retest reliability. Another way of assessing test reliability is alternate (or parallel) forms. This method involves administering a test to a group and then administering an alternate form of the same test to the same group. Correlation between the two scores is the estimate of the test reliability. A third method is to calculate the split-half reliability index. In this method, a test is split into two halves which are then treated as being parallel versions, and reliability is measured as the correlation between them. A more complex method is internal (or inter-item) consistency. This method is based on item-level data, and it computes inter-item correlations as Cronbach's alpha between 0 (low) and 1 (high); the greater the number of similar items, the greater the internal consistency. Finally, a method that checks both the reliability of the test and the reliability of the script raters is inter-rater reliability. This method refers to the correlation of scores given by different raters on the same script. 

This paper investigates rater agreement in Module 2 (free written production and written mediation) of the Greek State Certificate of Language Proficiency (KPG) exams in English. We focus on the factors affecting the rating process in the specific situation and examine the extent to which these factors contribute to higher inter-rater reliability. This index was selected as the others presented quite a few problems in our particular situation. Administrative restrictions do not allow the administration of the same KPG test twice, let alone the impracticality of asking the examinees to sit for a second identical test immediately after finishing the first one. Similarly, the administration of two different versions of the test for the same group of candidates in the same period is also prohibited. The written production part of the KPG exam in English requires the examinees to produce two different scripts based on different stimuli; thus, the split-half reliability index cannot be used either, as the test cannot be split into two equal halves. Therefore, test-retest, alternate forms and split-half reliability measures cannot be checked and inter-item consistency is not applicable since there are no items in this module. This leaves us with inter-rater reliability as the only method of checking test reliability and rater agreement for the specific exams.

An analysis of rater agreement is presented for the B2 and C1 level examinations in the years 2005, 2006, 2007 and 2008, for the B1 level examinations in the May 2007,  November 2007 and May 2008 periods, and for the intergraded A level of the May 2008 examination period. For the purposes of the present study, a number of randomly selected papers amounting to at least 40% of the total number for each level and each period were selected as a representative sample. This secures at least 500 papers for each analysis, which is adequate for a reliable measurement.

The analyses for the 2007 and 2008 exam periods were conducted on the whole of the population; that is all the KPG exam system candidates. The inter-rater reliability index was then computed separately for each of the two tasks that comprise the whole of the written part of the exam for all levels.

2. Aims of the study

The on-going study on inter-rater reliability in the KPG script rating process is carried out as a means of investigating the effectiveness of the instruments subservient to the process employed by the KPG test developers. These instruments are: a) the rating grid along with the assessment criteria, b) the script raters training material and training seminars and c) the on-the-spot consultancy to the raters by KPG experts and test developers. The training material and training seminars are each time especially designed for every single period based on the given tasks, resulting in specific instructions as to how each different writing task should be rated. The same applies for the consultancy provided to the script raters during the actual process of rating, which adds to the homogeneity of the rating grid interpretation.  The abovementioned instruments have been designed with the aim of achieving the highest possible rater agreement, which is part and parcel of the overall reliability for any test.

3. Approaches to rating scripts

Expressing thoughts in written form is ‘probably the most complex constructive act that most human beings are ever expected to perform’ (Bereiter and Scardamalia, 1983: 20 cited in Gamaroff, 2000). The complexity of the act makes the objective assessment of performance very difficult. The way a reader/script rater understands a written text especially in essays or even short compositions where inferential judgements have to be made varies depending on factors that have to do with the individual’s global comprehension of a passage, his or her inferential ability, and his or her interpretation of meaning of words in each context. Hamp-Lyons (1990) argues that the reliability of rating scripts heavily depends on the attitudes and conceptions of the rater. Therefore, problems arise in evaluating objectively when inferential judgements have to be converted to a score. It is true that one can have a largely objective scoring system when scores are primarily based on correct structural forms, as is the case with numerous language exams. However, this is not applicable in the KPG system as it does not focus on measuring correct syntactico-grammatical forms only, but on measuring ‘candidates’ abilities and skills to make socially purposeful use of the target language at home and abroad’ (RCeL, 2007). This does not simply entail correct syntax and grammar in the produced written texts, but also entails making appropriate language choices by taking into consideration the communicative and social context within which the produced text appears.

The question that comes up then is how can writing tasks be converted to numbers that will yield meaningful variance between learners? It has been suggested (Oller, 1979) that inferential judgements should be based on intended meaning and not merely on correct structural forms. Gamaroff (2000) suggests that when rating written texts, it is preferable for the rater to ‘rewrite’ the intended meaning in his or her mind and then decide on a mark. Still, even then, one cannot secure reliability and objectivity in rating as there are different conceptions of whether or not meaning has been conveyed appropriately.

Another approach to rating written texts is correcting syntactic-grammatical and lexical mistakes and accordingly subtracting points for every one depending on its seriousness. This approach has been heavily criticised as each rater has his or her own standards regarding what is grammatically correct or not, let alone the concept of the seriousness of mistakes, which also varies in the mind of every rater.

The above factors led researchers to construct rating grids to aid the rater in converting the text’s qualitative characteristics into quantitative ones. Rating grids have been found to contribute to the decrease in subjectivity when it comes to rating written texts, although they cannot secure absolute objectivity (Tsopanoglou, 2000). Evidently rating grids should be explicit and concise. They should be explicit so that raters will interpret them homogeneously; and they should be concise so that they are practical to use. When rating grids are properly employed by trained raters, the rating procedure will most likely display consistency among raters.

4. KPG writing, script rater training and the rating procedure

Module 2 of the KPG exam requires candidates to produce two texts of various lengths that range from 100 to 300 words each, depending on the exam level. The first text is produced based on stimulus given in English, whereas the second one is produced based on stimulus given in Greek. In the latter case, candidates have to act as mediators selecting information from the Greek source and transferring it to English producing scripts of similar or even completely different formats. Each script is rated twice, by two different script raters. The script raters rate each of the two tasks on a 0-15 scale, employing the rating grid and assessment criteria (Appendix) set by the test developers and without signalling anything on the papers themselves. They then mask their marks and names and return the papers to the examination secretariat. The rated papers are then randomly redistributed to the same pool of raters, taking care that no paper is given to the same rater that had initially marked it.

4.1 Training KPG script raters

Before the rating procedure for every examination period begins, the KPG English team prepares a training seminar for all script raters where the candidates’ performance expectations for the specific test are presented and discussed. The performance expectations are individualised for every single test and are determined in the piloting phase of the test prior to its administration and in pilot-evaluating sessions held after its administration. During the training seminar, raters have the chance to go through the rating grid in conjunction with the performance expectations and rate sample papers by applying the criteria that have been set for every different task. This ensures the adoption of a common approach towards rating and helps in establishing consistency in the given marks. This kind of training is an on-going process, as during the rating procedure itself, each script rater is assigned to a supervisor (an experienced and specially trained member of the KPG staff). The supervisor constantly offers support by discussing fine points and offering his or her opinion in cases where the rater is uncertain about the proper application of the criteria for assessing the paper.

4.2 Rating procedure

The script raters rate the two texts produced by each candidate on a scale of 0 to 15 for each text and on the basis of the rating grid previously discussed. Candidates’ papers are grouped in packs of 50 and are rated by two raters randomly selected from a pool of about 150. After the first rater has rated the 50 papers in a pack and the given marks are masked, the pack is passed to a second rater, who also gives his or her own marks. The final mark given to each candidate is the mean score between the two raters. It is interesting to note that the raters do not signal mistakes (either stylistic or structural) on the papers; therefore the second rater does not see any notes made on the paper by the first rater and is left completely uninfluenced, a fact that ensures the maximum possible objectivity. Bachman (2004) argues that it is essential that the second ratings be independent of the first and if written essays are scored, no identifying information and no marks should be made on the paper during the first rating. On the other hand, this procedure runs a bigger risk of inconsistencies in the way the two raters rate the responses resulting in measurement errors. Still, this issue is resolved by training the raters as meticulously as possible on how to apply the rating grid in combination with the candidates’ expected outcomes for every single exam, then fine-tuning and re-evaluating the procedure by constantly estimating the consistency across raters, or in other words, the inter-rater reliability of scores.

5. Approaches to inter-rater reliability

Inter-rater reliability is the widely used term for the extent to which independent raters evaluate data and reach the same conclusion (Lombard et al., 2005). It is part of the overall analysis for rater agreement, which is concerned with reconciling the raters’ subjectivity and the objective precision of the mark. Inter-rater reliability investigation is particularly important in ‘subjective’ tests such as essay tests, where fluctuations in judgements exist between different raters (Gamaroff, 2000). However, agreement among raters is extremely important not only in academic domains but in every domain where more than one judge rates performance. Such domains include areas as far apart from each other as gymnastics or figure skating in the Olympic Games, medical diagnoses, jurors’ judgements in criminal trials and test-eaters’ judgements on the chef’s performance when rating restaurants (Von Eye & Mun, 2005).

Inter-rater reliability studies in education mostly focus on the consistency of given marks in order to establish the extent of consensus on the use of the instrument (rating grid) by those who administer it. In such cases, it is vital that all raters apply the criteria on the rating grid in exactly the same way, resulting in a homogeneous rating approach. This, in turn, is one of the criteria that comprise a reliable testing system. Tinsley and Weiss (2000) prefer the term ‘inter‑rater (or inter-coder) agreement’ as they note that although inter-rater reliability assesses how far ‘ratings of different judges are the same when expressed as deviations from their means,’ inter‑rater agreement is needed because it measures ‘the extent to which the different judges tend to assign exactly the same rating to each object’ (p. 98). However, here, the term inter-rater reliability will be used in its widely accepted sense, as a correlation between the two sets of ratings (Bachman, 2004).

Statisticians have not reached a consensus on one universally accepted index of inter-rater reliability and depending on the type of data and the purpose of the study, different indices have been suggested (Penny et al., 2000; Alderson et al., 1995). Among those are: 1) percent exact agreement, 2) percent adjacent agreement, 3) Pearson product-moment correlation coefficient, 4) the phi index of dependability from the Generalizability Theory, and 5) the Intra-class Correlation Coefficient (ICC). Percent exact and adjacent agreement indices are computed just as one would expect from the names. That is, percent exact agreement is the percentage of times that two raters agree exactly on the score given to a performance, and percent adjacent agreement is the percentage of times that two raters agree to within one unit on the score given to a performance. For example, using a four-point integer scale, if one rater assigns a score of 2 and the second rater assigns a score of 3, then the ratings are not in exact agreement, but they are in adjacent agreement.

Perhaps the most popular statistic for calculating the degree of consistency between judges is the Pearson correlation coefficient or ‘Pearson’s r’ (Stemler, 2004). It is a convenient index as it can be computed by hand or by using most statistical software packages. If the rating scale is continuous, Pearson's r can be used to measure the correlation among pairs of raters. If the rating scale is ordinal, Spearman’s ñ is used instead. However, in both cases, the magnitude of the differences between raters is not taken into account. Shrout and Fleiss (1979), demonstrate this drawback with an example: If Judge A assigned the scores 9, 10, 5, 15 to four scripts and Judge B assigned 7, 8, 3, 13 to the same scripts (the difference is consistently kept at -2 points for all four scripts), then using Pearson's method, the correlation coefficient would be 1,00, indicating perfect correlation, which is definitely not the case in this example. Instead of Pearson’s r, Shrout and Fleiss (1979) suggest calculating the ICC as another way of performing reliability testing. The ICC is an improvement over Pearson's as it takes into account the differences in ratings for individual segments, along with the correlation between raters. In the example above, the ICC is .94, a measurement which is more representative of the case. All in all, the ICC should be used to measure the inter-rater reliability for two or more raters and especially if we are interested in using a team of raters and we want to establish that they yield consistent results.

The KPG system employs several judges who randomly form pairs. This, along with the fact that there are no signals or notes on the candidate’s paper after the first rating, leads to handling the raters as absolutely equal variables. That is, within any pair of ratings, there is no reason to identify one as 'first' and the other 'second'; if some or all of them are labelled the other way round the calculated correlation presumably would not change.  According to Shrout and Fleiss (1979), there are numerous versions of the intra-class correlation coefficient (ICC) that can give quite different results when applied to the same data. Therefore, careful consideration of the data layout and the ICC version is required if one is to come up with a valid index. When computing the ICC, the data should be laid out as n cases or rows, (each row corresponds to each script) and k variables or columns, for the different measurements (first and second rating) of the cases (Wuensch, 2007); in our model, there are two different measurements/ratings for every script. The cases are assumed to be a random sample from a larger population, and the ICC estimates are based on mean squares obtained by applying analysis of variance (ANOVA) models to these data. ICC varies depending on whether the judges in the study are all the judges of interest or are a random sample of possible judges, whether all targets are rated or only a random sample, and whether reliability is to be measured based on individual ratings or mean ratings of all judges (Shrout & Fleiss, 1979). When the judges/raters are conceived as being a random selection of possible raters/judges, then a one-way ANOVA is employed. That is, in this model judges are treated as a random sample, and the focus of interest is a one-way ANOVA testing if there is a subject/target effect (Garson, 1998).

Intra-class correlations in general are considered to be measures of reliability or measures of the magnitude of an effect, but they are equally important when it comes to calculating the correlations between pairs of observations that do not have an obvious order (Maclennan, 1993). The intra-class correlation coefficient can easily be computed in SPSS and other statistics software packages. There are five possible sets of output in the ICC estimates as offered in the SPSS; of those, the one most appropriate for computing the ICC in the KPG examination system is the one-way random effects model with an estimate for the reliability for the mean for average measures. In this model, judges/raters are conceived as being a random selection of possible raters/judges, who rate all targets of interest. Even though in this study not all targets of interest are measured, we still need to select the one-way random effects model because this model applies even when a given rating (e.g. the first rating) for one subject might be by a different judge than the first rating for another subject, etc. This in turn means there is no way to separate out a judge/rater effect (Garson, 1998). n, 1998).

The ICC can take any value between 0,00 (which signifies no correlation) and 1,00 (when there is no variance within targets). Statisticians give different interpretations of ICC values, but two of the most widely accepted interpretations are those of Fleiss (1981) and Landis and Koch (1977) presented in Tables 1 and 2, respectively.


r <0,40

poor agreement

0,40≤ r ≤0,75

good agreement

r >0,75

excellent agreement

Table 1. ICC interpretation according to Fleiss (1981)

r <0,00

poor agreement

0,00 ≤r ≤0,20


0,21 ≤r ≤0,40


0,41 ≤r ≤0,59


0,60 ≤r ≤0,79


0,80 ≤r ≤1,00

almost perfect

            Table 2. ICC interpretation according to Landis & Koch 1977)

Based on the information in the tables above, we can assume that a value of 0,60 and above can be considered to represent a satisfactory intra-class correlation, implying a satisfactory level of rater agreement.

6. Findings

Data from the examination periods in the years 2005-2008 were gathered and analysed using SPSS. All correlations are statistically significant (p<.05), and when checked for the Tukey's test of non-additivity showed that there is no multiplicative interaction between the cases and the items.

Free Writing Production


May 2005

November 2005

May 2006

November 2006

May 2007

November 2007

May 2008



































May 2005

November 2005

May 2006

November 2006

May 2007

November 2007

May 2008

































Table 3. ICC measurements for all levels and periods

Table 3 above shows that with the exception of C1 level in the November 2006 period, the ICC for both tasks (free writing production and mediation), for all levels and periods, is either well above or slightly below the cut-off score of 0,60 that we set as representative of a satisfactory agreement. The ICC for B1 level is slightly higher than that for the B2 level, which in turn is higher than that for the C1 level. The ICC for the intergraded A level is the highest of all. This is also reflected in the timeline Charts 1 and 2 below, which show the ICC fluctuation for the free writing production and mediation respectively.


Chart 1: ICC Free Writing Production Timeline for all levels

Chart 2: ICC Mediation Timeline for all levels

Charts 1 and 2 also clearly demonstrate the established pattern of lower ICC as the test level becomes higher. A closer look at Chart 1 shows a tendency for a more or less stabilized ICC index for the B2 level at around 0,75, whereas the ICC index for the C1 level shows an upward-sloping trendline converging with the B2 measurements. The same pattern is followed in the data presented in Chart 2, with the two lines converging around the 0,70 measurement. Even though data for the B1 level are not yet sufficient for any definite conclusion, one sees that ICC estimates for the free writing production (Chart 1) are slightly higher than those of B2 level and seem to be converging towards a little above 0,75. The B1 indices for mediation (chart 2) are significantly higher, moving well above 0,80. The measurements for A level are represented as dots, since there are records for only one period (May 2008); however the extremely high ICC index in the A level measurements for both tasks is obvious.

7. Discussion of findings

The analysis of the data obtained shows that raters demonstrated quite high correlations in the corresponding examination periods. The ICC is for most of the cases kept above r=.60, which qualifies them as strong correlations (Cohen, 1988). One can also notice that the ICC estimates are consistently higher for the ratings in level B2 than in level C1. This can be attributed to the fact that raters are more experienced in rating B2 level papers, as this specific exam was the first to have been administered by the Ministry of Education, almost a year and a half before the C1 level exam was introduced. Therefore, there were two rating periods where raters exclusively for B2 level scripts, before they started rating for C1 level. Additionally, we can assume that C1 level scripts demonstrate more complex language choices and deeper and broader cognitive processes, which leave raters with a broader range of decisions. 

The fact that the ICC for B1 level is higher than that of B2 level (although it is still early to establish a fixed pattern) can be attributed to various factors. One can be the lower language level for that test, which makes script rating simpler in terms of linguistic and syntactical choices and judgements. A second factor is that the candidates (and consequently the script raters) are given a sample script which they have to follow in the actual test. This script acts as a guide for the candidates when producing a script, resulting in a more or less homogenous approach to the requested task. The findings from the latest period confirm the hypotheses above. The ICC for the A level is almost 1,00 (perfect correlation). The judgement choices for the raters are rather limited as the tasks are simpler than those at B1 level, resulting in almost identical ratings.

There is a slight drop in ICC estimates in the November 2006 examination period. As one clearly sees in Charts 1 and 2, this drop is reflected in both B2 and C1 levels and in both tasks. Since in that examination period quite a large number of new raters were introduced into the system, one might assume that the ‘experience in rating KPG scripts factor’ was affected, resulting in this drop in ICC. The effect is rectified in the following periods, which leads us to come up with the assumption that experience in rating is of the utmost importance when it comes to rater agreement. It is worth noting that there were very similar correlations between the free written production and mediation tasks for each period and for each level when examined individually. This implies that raters follow a uniform approach towards applying the criteria set for every period and every separate exam. If one looks at correlations throughout the two levels in the last three years, one cannot fail to see that there are no extreme fluctuations in the strengths. This is another indication of uniformity in the overall approach towards rating written texts in the KPG system.

8. Conclusion

The analysis of the data yielded from the KPG script rating shows that raters apply the relevant criteria in a generally uniform way, showing strong significant correlations in the different rating periods. The instruments employed by the test developers have a positive effect on rater agreement indices as ICC estimates follow an upward-sloping trend line. This implies that experience in rating leads to better correlations, thus constant training of the raters on the part of the test developers is required.


Alderson, C. J., Clapham, C., & Wall, D. (1995). Language Test Construction and Evaluation. Cambridge: Cambridge University Press.

Bachman, L. F. (2004). Statistical Analyses for Language Assessment. Cambridge: Cambridge University Press.

Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences. Hillsdale, NJ: Erlbaum.

Fleiss, J. L. (1981). Statistical Methods for Rates and Proportions. New York: John Wiley & Sons, Inc.

Gamaroff, R. (2000). Rater reliability in language assessment: The bug of all bears. System, 28, 31-53.

Garson, D. G. (1998-2008). Reliability analysis. Retrieved April 12, 2008, from North Carolina State University Web site: http://http://www2.chass.ncsu.edu/garson/pa765/reliab.htm

Hamp-Lyons, L. (1990). Second Language Writing: Assessment Issues. In B. Kroll (Ed.), Second Language Writing (pp. 69-87). Cambridge: Cambridge University Press.

Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159-174.

Lombard, M., Snyder-Duch, J., & Campanella-Bracken, C. (2005, June 13). Practical resources for assessing and reporting intercoder reliability in content analysis research projects. Retrieved February 20, 2007, from Temple University Web Site: http://www.temple.edu/ispr/mmc/reliability/#What%20is%20intercoder%20reliability

Maclennan, R. N. (1993). Inter-rater reliability with SPSS for Windows 5.0. The American Statistician, 47(4), 292-296.

Oller, J. W. (1979). Language Tests at School. London: Longman.

Penny, J., Johnson, R. L., & Gordon, B. (2000). The effect of rating augmentation on inter-rater reliability: An empirical study of a holistic rubric. Assessing Writing, 7(2), 143-164. doi:10.1016/S1075-2935(00)00012-X

Professional Testing. (2006). Test reliability. Retrieved from Professional Testing Web site: http://www.proftesting.com/test_topics/pdfs/test_quality_reliability.pdf

RCeL. (2007). Research Centre for Language Teaching, Testing and Assessment. Retrieved February 20, 2007, from Research Centre for English Language Teaching, Testing and Assessment Web Site: http://www.rcel.enl.uoa.gr/

Shrout, P., & Fleiss, J. L. (1979). Intra class correlation: uses in assessing rater reliability. Psychological Bulletin, 86(2), 420-428.

Stemler, S. E. (2004). A Comparison of Consensus, Consistency, and Measurement Approaches to Estimating Inter-rater Reliability [Electronic version]. Practical Assessment, Research & Evaluation, 9(4). Retrieved February 24, 2007 from http://PAREonline.net/getvn.asp?v=9&n=4 .

Tinsley, H. E., & Weiss, D. J. (2000). Inter-rater Reliability and Agreement. In H. E. A. Tinsley & S. D. Brown  (Eds.), Handbook of Applied Multivariate Statistics and Mathematical Modeling (pp. 95-124). San Diego, CA.: Academic Press.

Ôsopanoglou, Á. (2000). Ìåèïäïëïãßá ôçò ÅðéóôçìïíéêÞò ¸ñåõíáò êáé ÅöáñìïãÝò ôçò óôçí Áîéïëüãçóç ôçò ÃëùóóéêÞò ÊáôÜñôéóçò. Èåóóáëïíßêç: Åêäüóåéò ÆÞôç

Trochim, W. M. (2000). The Research Methods Knowledge Base (2nd ed.). Available at: http://www.socialresearchmethods.net/kb/

Von Eye, A., & Mun, E. Y. (2005). Analyzing Rater Agreeent: Manifest Variable Methods. Mahwah, NJ: Lawrence Erlbaum.

Wuensch, K. L. (2006).  In Inter-rater Agreement. Retrieved February 25, 2007, from East Carolina University - Dept. of Psychology Web Site: http://core.ecu.edu/psyc/wuenschk/docs30/Inter-rater.doc



B2 level marking grid

1. Text content, form, style and organization


2. Appropriacy of lexicogrammatic selections


3.  Appropriacy of linguistic expression cohesion and coherence of discourse






Production of a written text of specific content*, in accordance with the communication case as defined in the directions, which describes the choice of form, styles and organization of the form of speech prescribed by the ‘norm’ (e.g. advertisement, application, report).

Selection of the proper linguistic elements (words and phrases), given their textual and contextual framework.



Appropriate grammatical and syntactical use of language with coherence and cohesion.












Very satisfactory



A text that responds successfully to the 3 criteria.

Development of general meaning and partial meanings with acceptable uses of language and cohesion of speech.


It has got minimum errors which do not prevent the transfer of the meaning of the text.


A text that refers to the subject of the test and has got the required form and organization. It includes some mistakes which do not essentially obstruct communication. In general, it is a text with a natural flow of discourse.

It includes generally accepted uses of language, although the linguistic selections may not always be the most appropriate.


Certain linguistic selections are not appropriate but in accordance with the basic grammar rules.






Moderately satisfactory

A text that does not deviate from the subject of the test and is in the demanded form. It includes some errors that hinder the transfer of meaning. In general, it is a text of not totally natural flow of speech and cohesion of phrases.

Certain linguistic selections are not appropriate and in some cases deviate from rules of language usage, but the diction is satisfactory.


Several linguistic selections are inappropriate, diction relatively satisfactory, but some phrases are awkward.


A text that deals with aspects of the subject and approaches the form requested by the task. It includes errors that hinder understanding at some points, but there is relevant cohesion of discourse.

Certain linguistic selections are not appropriate and deviate from the acceptable use of language.


Several linguistic selections are inappropriate and/or incorrect according to grammar rules.





Partly satisfactory

A text that does not deal with the subject in an absolutely satisfying a manner and does not exactly have the form requested by the task. To an extent errors inhibit its general understanding.

Limited vocabulary, inappropriate expressions and errors, but the meaning is transferred.


The general meaning is transferred, but the particular information is difficult to understand.


A test which does not have the form requested by the task and includes errors of various types.

Many errors significantly hindering the understanding of main points.


Many and serious errors of vocabulary, grammar, spelling, etc.



Not satisfactory





Text not understood


Words scattered


No answer


* For Activity 2:  based on a prompt in Greek

B1 level marking grid

Criterion 1: Task completion. Candidate responded in terms of: (a) content/topic, communicative purpose [mediation], and (c) genre and register

Criterion 2: Text grammar. The text is assessed for (a) organization/ ordering of ideas, (b) text coherence, (c) the cohesion devices used

Criterion 3: Grammar & vocabulary. The text is assessed for (a) grammar/syntax, (b) choice of vocabulary, (c) spelling and punctuation



Fully satisfactory


Good organization of ideas, fully coher­ent text, simple but cor­rect/appropriate cohesion devices.

Correct structures; appro­priate vocabulary; standard spelling.


Insignificant grammar er­rors; appropriate but limited vocabu­lary; few spelling errors that do not distort meaning.



Good organization of ideas, fully coher­ent text, simple and mostly cor­rect/appropriate cohesion devices.

Simple structures with a few insig­nificant errors; vocabulary is limited; there are occa­sional vocabulary and spelling errors that do not distort meaning.


Simple structures with insignificant errors, vocabulary is limited but only few words are incorrect; spelling is acceptable.



Moderately satisfactory


Fairly good organization of ideas, fully coherent text, using simple but mostly correct cohesion devices.

Simple structures with a few errors that do not distort meaning; limited and sometimes incorrect vocabulary and spelling.


Simple structures with a few serious errors that do not interfere with intelligibility; limited but appropriate vocabulary ;  some incorrect words and spelling.



Fairly good organization of ideas, coherent text, simple cohesion devices which are sometimes incorrect and frequently inappropriate.

A few problematic structures; generally lim­ited and somewhat inappropriate vocabulary; some incorrect words and spelling.


A few problematic structures; limited and inappropriate vocabulary; some incorrect words and spelling.




Unsatisfactory – questionable pass


Ideas are somewhat disorganized, but the text is generally coherent and the cohesion of the text is sometimes problematic.

Quite a few grammar errors, vocabulary and spelling but the errors generally do not impede intelligibility.


Frequent grammar errors, use of some inap­propriate vocabulary and often wrong spell­ing. But, these errors do not impede intelligi­bility.



Poor organization of ideas, sometimes text is incoherent and the cohesion seriously problematic.

Grammar and vocabulary errors are frequent and parts of the text are sometimes difficult to understand.


Grammar and vocabulary errors are very frequent and a few parts of the text are unin­telligible.



Irrelevant text



Unintelligible text



No response or scattered words






Forthcoming Issue
Current Issue
Back Issues

Call for Papers


©2009-2011  ÔìÞìá ÁããëéêÞò Ãëþóóáò êáé Öéëïëïãßáò - ÅÊÐÁ  /  Faculty of English Language and Literature - UOA

Developed By A.Sarafantoni
Designed By