Human Readers: They May as Well be Robots

As noted in the prior post, and in the piece by Steve Kolowich in the Chronicle of Higher Education, April 28, 2014,  “Writing Instructor, Skeptical of Automated Grading, Pits Machine vs. Machine”, we read this:

In presentations, he likes to show how the Gettysburg Address would have scored poorly on the SAT writing test. (That test is graded by human readers, but Dr. Perelman says the rubric is so rigid, and time so short, that they may as well be robots.)

Which was delightful, since in this piece covering the well-honed and oft-reported voice of a leading critic of “automated essay scoring”, clearly in sympathy to the likes of another professor and his associates behind the petition at, Dr. Perelman himself acknowledges the primary flaw in most large scale uses of AES technology in educational measurement today: the “production scoring techniques” actually employed for human scoring essays are generally so far removed from all the arguments for “thoughtful, knowledgable and nuanced” engagement in student writing required for useful feedback and genuine analysis that to argue that using “AI” (aka advances in data science and machine learning) will deprive students of this necessary human care and consideration is, well, pretending  that care and consideration is a hallmark of current practice and student experience.

Or said another way, this position exactly sets up the proper consideration of where and how to use “constructed response” assessment times, and how to score them (with humans, AES technology or some combination) in a way that is best for teaching & learning, and for the steady practice of critical thinking and of the craft of writing.  [On second thought, where and when to use “authentic” assessment items, should be, and is, basically “always”…  Trouble is, as is noted elsewhere, most “essay questions” are so poor, that with generic rubrics and half a dozen or more subjective criteria collapsed into a small set of dimensions and score points, human inter-rater reliability for anything remotely “realistic” shows that the measures are often more noise than signal… but this is another bit for another day.]

Now this comment by Dr. Perelman is not a direct quote, nor could I find anything in this Chronicle piece, the Perelman paper or even in the 2005 NYT article referenced by Kolowich (by Michael Winerip) related to the “Gettysburg Address” comment specifically, but I think the essence of the complaint is extremely valid and worth further consideration.

It is one I have previously cast as: Just how meaningful can a “essay score” be when the scorer, whatever the rubric, is grading 32 tests and hour?  The “they might as well be robots” formulation stated by Kolowich to summarize Perelman’s main point seems to be saying the same thing, and it would be interesting to understand this more deeply.  Is the thought that the score assigned by high speed “human robots” would have to be tied to simple surface features, mechanics and vocabulary — things that can be evaluated almost at a glance? Or that the rubrics, and number of score points and dimensions, don’t allow for any real nuance — or consistency — even if more time and thought was brought to bear on scoring? Or both? Or something else?

It would have been very interesting to hear more on this point, rather than simply here yet more (well deserved) rage at typical automated essay scoring in general.  Arguments don’t need to be completely theoretical however (e.g. AES engines today do not do the work the way humans do) and lacking in nuance and texture – because, once you get past “computers don’t think”, the better, more cogent point about “the lack of construct relevance” poorly understood and usually not even stated .  For now.

