Scoring Long-form Constructed Response: Statistical Challenges in Model ValidationPosted: September 18, 2014
The increasing desire for more authentic assessment, and to the assessment of higher order cognitive abilities, is leading to an increased focus on performance assessment and the measurement of problem-solving skills, among other changes, in large scale educational assessment.
Present practice in production scoring for constructed response assessment items where student responses of one to several paragraphs are evaluated on well-defined rubrics by distributed teams of human scorers currently yields – in many cases — results that are barely acceptable even for course-grained, single-dimension metrics. That is, even when scoring essays on a single, four to six point scale (as for example was done in the ASAP competition for automated essay scoring on Kaggle), human scorer inter-rater reliability is marginal (or at least, less reliable than might be expected), in the sense that inter-rater agreement rates ranged from 28 to 78%, with associated quadratic-weighted Kappas ranging from .62 to .85.
Said another way, about half the time, two raters, or human (averaged) scores and AI scoring engines, will yield the same result for a simple measure, while the rest of the time, the variation can be all over map. Kappa doesn’t really tell us very much about this variation, which is a concern, because “better” (higher) Kappas might also mask abnormal or biased relationships, where models with slightly lower Kappas might, on examination, provide an intuitively more appealing result. And for scoring solutions that seek to use more detailed scoring rubrics, and to provide sub-scores and more nuanced feedback, while still solving for reliability and validity in overall scoring, the challenge of finding the “best” model for a given dataset will be even greater.
I have written a short paper that focuses solely on the problem of evaluating models that attempt to mimic human scores that are provided under the best of conditions (e.g. expert scorers not impacted by timing constraints), and addresses the question of how to define the “best” performing models. The aforementioned Kaggle competition chose Quadratic Weighted Kappa (QWK) as a means of measuring the conformance of scores reported by a model and scores assigned by human scorers. Other Kaggle competitions routinely use other metrics as well, while some critics of the ASAP competition in particular, and of the use of AES technology in general, have argued that other model performance metrics might be more appropriate.
[Update: at this point the paper simply illustrates why QWK is not by itself sufficient to definatively say one model is “better” than another by providing a counter example and some explanation of the problem.]
As a single descriptive statistic, QWK has inherent limits in describing the differences between two populations of results. Accordingly, this short note will present an example to illustrate the extent of these limitations. In short I think that – at least for a two-way comparison between a set of results from human scorers and a set of results from a trained machine learning model trying to emulate human scorers, the basic “confusion matrix” that shows a two dimensional grid with exact match results on a diagonal, and non-exact matches as outliers provides an unbeatable visualization of just how random, or not, the results of using a model might look against a set of “expert” measures.
Future efforts will considers suggested alternatives to QWK, or additional descriptive statistics that can be used in conjunction with QWK, hopefully leading to some more usable and “better” criterion for particular use cases and suggestions for further research.
Feedback welcome! Full document is linked here: SLFCR-scmv-140919a-all
 See particularly section entitled “Flawed Experimental Design I”. from Les C. Perlman’s paper at http://journalofwritingassessment.org/article.php?article=69