How best to assess our educational assessment tools has been an ongoing question for me for some time. Measurement is an inherently statistical activity, and unsurprisingly, figuring out how well the tools are working, and which tools work better, is therefore largely a statistical question — what tool measured what skill, ability or knowledge best, with least error or greatest reliability, for whom, at what ability level, with what evidence, etc. are difficult questions. They questions are without end. But since defining and demonstrating the relevance of “evidence” to “construct” can be nuanced and difficult, it is nice that some topics are simple and direct — such as how to do different measurements compare of what is, ostensibly, the “same thing”. This is a bit more straightforward.
That said, if it develops that a test question scored by “expert” graders finds wide disparity between the scores assigned by different graders, many explanations are possible: is this a reflection of a lack of precision in the instrument, a lack of congruence between the thing measured and the ability used, differences in opinion and viewpoint by the two scorers — on how different aspects of the sub-domain impact the overall evaluation of the subject? If fundamentally, two essay graders cannot agree to the score of say “quality of writing” even most of the time, on a relatively small score sale, I find it hard to move past this point to try to improve “scoring” when the basic measure itself seems to be in question.
So how to think about this “scoring” challenge itself requires a view of how two sets of scores by different scorers might “correlate” and what might constitute useful correlation and what might not. I have previously commented on how the distribution of score from two scoring sources might be compared, and how “comparable” sets of scores could still, by some measures, reflect or hide significant bias in measurement.
This issue came to mind during a recent reading of an otherwise excellent and well documented research report 1 on the use of “e-Rater”, ETS’ tool for analyzing “essays”, TOEFL (ETS’ English language proficiency test) essays to understand how the technology developed to score a certain class of educational assessment “essays” or constructed response items (primarily or entirely in terms of “quality of writing”, it is important to note), might fare when used on evaluating the strength of second-language acquisition / ESL writing skills.
Whatever one makes of the content, there are 8 tables of data comparing how two different raters score the same essays and how those measurements compare to one-another; how the human an e-rater ratings compare (either individually, or as an average of the human scores, etc. ) as well as how these measures compare with other measures — self-evaluation, instructor evaluations (both ESL instructors and instructors in the student’s major area), and so on. Of the various comparisons made in the eight tables, it was interesting to see that the person’s r correlations with the human essay scores to e-rater scores, or to anything or between anything else, were generally in the low end of the 0.23 to 0.45 ranges — clustered below .40 (see tables 2,3,4 and 5; correlations between professor’s writing judgements and e-rater scores (table 8) were 0.15 and 0.18 (!) while higher for human ratings of their iBT essays but still only .15 to .33. As the scoring engine was being used for a non-designed purpose, low levels of correlation were not a surprise. What was a surprise, however, was the summative comments by the authors that acknowledge that:
“As for considerations of criterion-related validity, correlations between essay scores and other indicators of writing ability were generally moderate, whether they were scored by human raters or e-rater. These moderate correlations are not unlike those found in other criterion-related validity studies (see, for example, Kuncel et al., 2001 for a meta-analysis of such studies of the GRE). They are also similar to or higher than those presented in Powers et al. (2000), comparing e-rater scores of GRE essays with a variety of other indicators. The correlations in that study ranged from .08 to .30 for a single human rater, from .07 to .31 for two human raters, and from .09 to .24 for e-rater. …”
What was surprising to me was primarily that such low levels of correlation would be described as “generally moderate” in a peer-reviewed, academic journal. It makes me hope that higher standards are employed when AES scoring is actually used for “high stakes” testing, and that testing companies are transparent about both their scoring methodology and the statistical underpinnings of any scoring decisions made by algorithm.
I have read that there is fairly broad agreement among US citizens about what level of overall income tax seems “fair and reasonable”, and that many surveys point to something around 25% as a consensus figure if one rate had to apply to everyone. I am wondering what parents or teachers would think is a reasonable level of inter-rater agreement for scoring a constructed response item, say on a “science” test or a “reading” test? If an essay is scored on a 1 to 6 point scale, or a longish task is scored with zero to up to three points, and two humans grade each essay, would there be an expectation that two (qualified, trained) scorers would “agree” (meaning exactly, in case you are a psychometrician or statistician) most of the time? 2/3 of the time? all of the time? or that the correlation between any two scores for the same essay by qualified graders would be at least X? And if less than half the time two scorers would agree on the score for an essay, would this be viewed as problematic ? or “close enough”? Of course, “it depends”, but…if you said “moderate correlation”, would that suffice?
My own view is that a substantial portion of the challenge in scoring “constructed response items” comes from the degree to which the rubrics used to score answers for these questions are too vague and leave too many rules for applying the rubric unspoken or for the grader to decide themselves. [For example, scoring rubrics that call for distinctions between “adequate mastery”, “reasonable mastery” and “clear and consistent mastery” of a skill or some knowledge without defining these distinctions or gradations in an objective, concrete way]. Particularly where there is a single, “holistic” score, but also in more narrow scoring scenarios, there will always be component elements of the score that different graders consider, and unless there is common guidance on how the rubric is to be applied in a variety of scenarios, and scorers are trained in the same way with the same results, the level of “noise” in the score “signal” may in many instances struggle to stand out above the din and clamor or variation introduced by individual preferences, interpretations and ideas, etc., yielding measurements with unfortunate reliability and tarnishing the value attached to the assessment because, from first hand experience, people will begin to see variations in performance and ability that change from test to test, and do not reflect apparent differences in the demonstrated skills, knowledge and ability of the examinees themselves3.
One last note: I found useful when thinking about “how much inter-rater agreement would be a minimum indication of useful measurement”, I read this bit 4 which suggested a lower bound (quoting):
“Specifically, the quadratic-weighted kappa between automated and human scoring must be at least .70 (rounded normally) on data sets that show generally normal distributions, providing a threshold at which approximately half of the variance in human scoring is accounted for by [the automated scoring engine]…
This value was selected on the conceptual basis that it represents the “tipping point “at which signal outweighs noise in agreement. The identical criterion of .70 has been adopted for product-moment correlation with the same underlying rationale regarding proportion of variance accounted for by [the technology].” [Emphasis / color added by ME!]
Of course, a) talking about “normal distributions ” when you have a four point scale has less meaning then it might in other situations… and b) people of good will can disagree on such things (in both the details and as a matter of their own ideas about how learning, and measurement, work…)
1) see Weigle, S. C. (2010). Validation of automated scores of TOEFL iBT tasks against non-test indicators of writing ability. Language Testing, 27(3), 335-353. or here: http://ltj.sagepub.com/content/27/3.toc
2) From wikipedia, for example (and note I am not declaring wikipedia infallible, or even authoritative, but citing it as a reflection of what some? many? at least one? people might consider reasonable):
As noted in the prior post, and in the piece by Steve Kolowich in the Chronicle of Higher Education, April 28, 2014, “Writing Instructor, Skeptical of Automated Grading, Pits Machine vs. Machine”, we read this:
In presentations, he likes to show how the Gettysburg Address would have scored poorly on the SAT writing test. (That test is graded by human readers, but Dr. Perelman says the rubric is so rigid, and time so short, that they may as well be robots.)
Which was delightful, since in this piece covering the well-honed and oft-reported voice of a leading critic of “automated essay scoring”, clearly in sympathy to the likes of another professor and his associates behind the petition at humanreaders.org, Dr. Perelman himself acknowledges the primary flaw in most large scale uses of AES technology in educational measurement today: the “production scoring techniques” actually employed for human scoring essays are generally so far removed from all the arguments for “thoughtful, knowledgable and nuanced” engagement in student writing required for useful feedback and genuine analysis that to argue that using “AI” (aka advances in data science and machine learning) will deprive students of this necessary human care and consideration is, well, pretending that care and consideration is a hallmark of current practice and student experience.
Or said another way, this position exactly sets up the proper consideration of where and how to use “constructed response” assessment times, and how to score them (with humans, AES technology or some combination) in a way that is best for teaching & learning, and for the steady practice of critical thinking and of the craft of writing. [On second thought, where and when to use “authentic” assessment items, should be, and is, basically “always”… Trouble is, as is noted elsewhere, most “essay questions” are so poor, that with generic rubrics and half a dozen or more subjective criteria collapsed into a small set of dimensions and score points, human inter-rater reliability for anything remotely “realistic” shows that the measures are often more noise than signal… but this is another bit for another day.]
Now this comment by Dr. Perelman is not a direct quote, nor could I find anything in this Chronicle piece, the Perelman paper or even in the 2005 NYT article referenced by Kolowich (by Michael Winerip) related to the “Gettysburg Address” comment specifically, but I think the essence of the complaint is extremely valid and worth further consideration.
It is one I have previously cast as: Just how meaningful can a “essay score” be when the scorer, whatever the rubric, is grading 32 tests and hour? The “they might as well be robots” formulation stated by Kolowich to summarize Perelman’s main point seems to be saying the same thing, and it would be interesting to understand this more deeply. Is the thought that the score assigned by high speed “human robots” would have to be tied to simple surface features, mechanics and vocabulary — things that can be evaluated almost at a glance? Or that the rubrics, and number of score points and dimensions, don’t allow for any real nuance — or consistency — even if more time and thought was brought to bear on scoring? Or both? Or something else?
It would have been very interesting to hear more on this point, rather than simply here yet more (well deserved) rage at typical automated essay scoring in general. Arguments don’t need to be completely theoretical however (e.g. AES engines today do not do the work the way humans do) and lacking in nuance and texture – because, once you get past “computers don’t think”, the better, more cogent point about “the lack of construct relevance” poorly understood and usually not even stated . For now.