Correlation, Causation and How to think about EMQ (Educational Measurement Quality)Posted: February 20, 2015
How best to assess our educational assessment tools has been an ongoing question for me for some time. Measurement is an inherently statistical activity, and unsurprisingly, figuring out how well the tools are working, and which tools work better, is therefore largely a statistical question — what tool measured what skill, ability or knowledge best, with least error or greatest reliability, for whom, at what ability level, with what evidence, etc. are difficult questions. They questions are without end. But since defining and demonstrating the relevance of “evidence” to “construct” can be nuanced and difficult, it is nice that some topics are simple and direct — such as how to do different measurements compare of what is, ostensibly, the “same thing”. This is a bit more straightforward.
That said, if it develops that a test question scored by “expert” graders finds wide disparity between the scores assigned by different graders, many explanations are possible: is this a reflection of a lack of precision in the instrument, a lack of congruence between the thing measured and the ability used, differences in opinion and viewpoint by the two scorers — on how different aspects of the sub-domain impact the overall evaluation of the subject? If fundamentally, two essay graders cannot agree to the score of say “quality of writing” even most of the time, on a relatively small score sale, I find it hard to move past this point to try to improve “scoring” when the basic measure itself seems to be in question.
So how to think about this “scoring” challenge itself requires a view of how two sets of scores by different scorers might “correlate” and what might constitute useful correlation and what might not. I have previously commented on how the distribution of score from two scoring sources might be compared, and how “comparable” sets of scores could still, by some measures, reflect or hide significant bias in measurement.
This issue came to mind during a recent reading of an otherwise excellent and well documented research report 1 on the use of “e-Rater”, ETS’ tool for analyzing “essays”, TOEFL (ETS’ English language proficiency test) essays to understand how the technology developed to score a certain class of educational assessment “essays” or constructed response items (primarily or entirely in terms of “quality of writing”, it is important to note), might fare when used on evaluating the strength of second-language acquisition / ESL writing skills.
Whatever one makes of the content, there are 8 tables of data comparing how two different raters score the same essays and how those measurements compare to one-another; how the human an e-rater ratings compare (either individually, or as an average of the human scores, etc. ) as well as how these measures compare with other measures — self-evaluation, instructor evaluations (both ESL instructors and instructors in the student’s major area), and so on. Of the various comparisons made in the eight tables, it was interesting to see that the person’s r correlations with the human essay scores to e-rater scores, or to anything or between anything else, were generally in the low end of the 0.23 to 0.45 ranges — clustered below .40 (see tables 2,3,4 and 5; correlations between professor’s writing judgements and e-rater scores (table 8) were 0.15 and 0.18 (!) while higher for human ratings of their iBT essays but still only .15 to .33. As the scoring engine was being used for a non-designed purpose, low levels of correlation were not a surprise. What was a surprise, however, was the summative comments by the authors that acknowledge that:
“As for considerations of criterion-related validity, correlations between essay scores and other indicators of writing ability were generally moderate, whether they were scored by human raters or e-rater. These moderate correlations are not unlike those found in other criterion-related validity studies (see, for example, Kuncel et al., 2001 for a meta-analysis of such studies of the GRE). They are also similar to or higher than those presented in Powers et al. (2000), comparing e-rater scores of GRE essays with a variety of other indicators. The correlations in that study ranged from .08 to .30 for a single human rater, from .07 to .31 for two human raters, and from .09 to .24 for e-rater. …”
What was surprising to me was primarily that such low levels of correlation would be described as “generally moderate” in a peer-reviewed, academic journal. It makes me hope that higher standards are employed when AES scoring is actually used for “high stakes” testing, and that testing companies are transparent about both their scoring methodology and the statistical underpinnings of any scoring decisions made by algorithm.
I have read that there is fairly broad agreement among US citizens about what level of overall income tax seems “fair and reasonable”, and that many surveys point to something around 25% as a consensus figure if one rate had to apply to everyone. I am wondering what parents or teachers would think is a reasonable level of inter-rater agreement for scoring a constructed response item, say on a “science” test or a “reading” test? If an essay is scored on a 1 to 6 point scale, or a longish task is scored with zero to up to three points, and two humans grade each essay, would there be an expectation that two (qualified, trained) scorers would “agree” (meaning exactly, in case you are a psychometrician or statistician) most of the time? 2/3 of the time? all of the time? or that the correlation between any two scores for the same essay by qualified graders would be at least X? And if less than half the time two scorers would agree on the score for an essay, would this be viewed as problematic ? or “close enough”? Of course, “it depends”, but…if you said “moderate correlation”, would that suffice?
My own view is that a substantial portion of the challenge in scoring “constructed response items” comes from the degree to which the rubrics used to score answers for these questions are too vague and leave too many rules for applying the rubric unspoken or for the grader to decide themselves. [For example, scoring rubrics that call for distinctions between “adequate mastery”, “reasonable mastery” and “clear and consistent mastery” of a skill or some knowledge without defining these distinctions or gradations in an objective, concrete way]. Particularly where there is a single, “holistic” score, but also in more narrow scoring scenarios, there will always be component elements of the score that different graders consider, and unless there is common guidance on how the rubric is to be applied in a variety of scenarios, and scorers are trained in the same way with the same results, the level of “noise” in the score “signal” may in many instances struggle to stand out above the din and clamor or variation introduced by individual preferences, interpretations and ideas, etc., yielding measurements with unfortunate reliability and tarnishing the value attached to the assessment because, from first hand experience, people will begin to see variations in performance and ability that change from test to test, and do not reflect apparent differences in the demonstrated skills, knowledge and ability of the examinees themselves3.
One last note: I found useful when thinking about “how much inter-rater agreement would be a minimum indication of useful measurement”, I read this bit 4 which suggested a lower bound (quoting):
“Specifically, the quadratic-weighted kappa between automated and human scoring must be at least .70 (rounded normally) on data sets that show generally normal distributions, providing a threshold at which approximately half of the variance in human scoring is accounted for by [the automated scoring engine]…
This value was selected on the conceptual basis that it represents the “tipping point “at which signal outweighs noise in agreement. The identical criterion of .70 has been adopted for product-moment correlation with the same underlying rationale regarding proportion of variance accounted for by [the technology].” [Emphasis / color added by ME!]
Of course, a) talking about “normal distributions ” when you have a four point scale has less meaning then it might in other situations… and b) people of good will can disagree on such things (in both the details and as a matter of their own ideas about how learning, and measurement, work…)
1) see Weigle, S. C. (2010). Validation of automated scores of TOEFL iBT tasks against non-test indicators of writing ability. Language Testing, 27(3), 335-353. or here: http://ltj.sagepub.com/content/27/3.toc
2) From wikipedia, for example (and note I am not declaring wikipedia infallible, or even authoritative, but citing it as a reflection of what some? many? at least one? people might consider reasonable):