Scoring Long-form Constructed Response: Statistical Challenges in Model Validation

SLFCR-SCMV-Cover20140901The increasing desire for more authentic assessment, and to the assessment of higher order cognitive abilities, is leading to an increased focus on performance assessment and the measurement of problem-solving skills, among other changes, in large scale educational assessment.

Present practice in production scoring for constructed response assessment items where student responses of one to several paragraphs are evaluated on well-defined rubrics by distributed teams of human scorers currently yields – in many cases — results that are barely acceptable even for course-grained, single-dimension metrics. That is, even when scoring essays on a single, four to six point scale (as for example was done in the ASAP competition for automated essay scoring on Kaggle[1]), human scorer inter-rater reliability is marginal (or at least, less reliable than might be expected), in the sense that inter-rater agreement rates ranged from 28 to 78%, with associated quadratic-weighted Kappas ranging from .62 to .85.

Said another way, about half the time, two raters, or human (averaged) scores and AI scoring engines, will yield the same result for a simple measure, while the rest of the time, the variation can be all over map. Kappa doesn’t really tell us very much about this variation, which is a concern, because “better” (higher) Kappas might also mask abnormal or biased relationships, where models with slightly lower Kappas might, on examination, provide an intuitively more appealing result. And for scoring solutions that seek to use more detailed scoring rubrics, and to provide sub-scores and more nuanced feedback, while still solving for reliability and validity in overall scoring, the challenge of finding the “best” model for a given dataset will be even greater.

I have written a short paper that focuses solely on the problem of evaluating models that attempt to mimic human scores that are provided under the best of conditions (e.g. expert scorers not impacted by timing constraints), and addresses the question of how to define the “best” performing models.  The aforementioned Kaggle competition chose Quadratic Weighted Kappa (QWK) as a means of measuring the conformance of scores reported by a model and scores assigned by human scorers. Other Kaggle competitions routinely use other metrics as well[2], while some critics of the ASAP competition in particular, and of the use of AES technology in general, have argued that other model performance metrics might be more appropriate[3].

[Update: at this point the paper simply illustrates why QWK is not by itself sufficient to definatively say one model is “better” than another by providing a counter example and some explanation of the problem.]

As a single descriptive statistic, QWK has inherent limits in describing the differences between two populations of results. Accordingly, this short note will present an example to illustrate the extent of these limitations. In short I think that – at least for a two-way comparison between a set of results from human scorers and a set of results from a trained machine learning model trying to emulate human scorers, the basic “confusion matrix” that shows a two dimensional grid with exact match results on a diagonal, and non-exact matches as outliers provides an unbeatable visualization of just how random, or not, the results of using a model might look against a set of “expert” measures.

Future efforts will considers suggested alternatives to QWK, or additional descriptive statistics that can be used in conjunction with QWK, hopefully leading to some more usable and “better” criterion for particular use cases and suggestions for further research.

Feedback welcome!  Full document is linked here: SLFCR-scmv-140919a-all



[1] See

[2] See

[3] See particularly section entitled “Flawed Experimental Design I”. from Les C. Perlman’s paper at


2 Comments on “Scoring Long-form Constructed Response: Statistical Challenges in Model Validation”

  1. […] Scoring Long-form Constructed Response: Statistical Challenges in Model Validation → […]

  2. […] “correlate” and what might constitute useful correlation and what might not.  I have previously commented on how the distribution of score from two scoring sources might be compared, and how […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s