Bias and variance, precision and recall – these are concepts that, after a few months or maybe even a just a couple of weeks of crawling around in actual data, predictive models, and the study of where prediction and reality meet — begin to have an intuitive feel. But it was nice to read recently a short piece that brings these concepts clearly into focus, and frames them in terms of model behavior. This is something I will keep handy to share where my own jabbering on the subject is likely to be less clear and certainly less concise. The source of the article was (via re-post) the KDNuggets blog, which is an excellent resource.
There are, perhaps unsurprisingly, many good “nuggets” on the KDnuggest blog / web site. And this latest item does a good job of explaining what is at some point intuitive to people who work with machine learning models regularly. Perhaps this is particularly relevant to modeling and mining “text’ — the work I have been doing in Machine Learning — because it certainly is spot on. And this is more a way of describing how the math models the real world, and how the data is reflected in the math, so I expect this view is likely helpful to anyone modeling data.
The somewhat “click-bait” sounding title — “4 Reasons Your Machine Learning Model is Wrong” is only modestly apologized for with the “(and How to Fix It)” suffix, but makes me worry fake-aggressive, pretend-demeaning discourse could be among the worst forms of carry-over of 2016 into 2017.
I will instead remember that genuine aggressive, demaning discourse is worse… and continue to appreciate the sharing that sites like this do for the larger community.
Happy New Year!
Google posted information about TensorFlow — the release of as open source of a key bunch of machine learning tools on their Google research blog here.
Given the great piles of multi-dimensional tables (or arrays) of data machine learning typically involves, and (at least for us primitive users) the tremendous shovel work involved in massaging and pushing around these giant piles of data file (and sorting out the arcane naming schemes devised to try to help with this problem is almost a worse problem itself),
the appellation of “Tensor Flow” as a tool to help with this is at first blush very promising. That is, rather than just a library of mathmatical algorithm implementations, I am expecting something that can help make the machine learning work itself more manageable.
I suspect that just figuring out what this is will cost me a few days… but I have much to learn.
As can be seen in the paper posted previously, I continue to find the “Confusion Matrix” an excellent tool for judging the performance of machine learning (or other) models designed to predict outcomes for cases where the true outcome can also be determined.
Even simple attempts at explanations sometimes fail — witness the Wikipedia entry — and since I find this so helpful in looking at Machine Learning model performance, as noted in the prior post, I thought I’d provide a brief example here (from the paper) as well as provide pointers to great web resources for understanding “Quadratic-Weighted Kappa” — a single descriptive statistic that is often used to quantify the validity of “inter-rater reliability” in a way that is more useful, or comprehensive, than mere “accuracy”, if less descriptive by nature than these lovely visual aids.
So here are two Confusion Matrices representing output from two different theoretical machine learning models (click to expand):
The point of these two model performance diagrams was to show that while the two models have identical “accuracy” (or exact match rates between predicted and achieved output), the first model has a more balanced error distribution than the second. The Second model has a “better” quadratic weighted kappa — but also demonstrates a consisted “over-scoring” bias. I think most people would agree that the first model is “better”, despite it’s (slightly) lower QWK.
And then lastly, the promised reference information for folks simply interested in getting a better grip on Quadratic-Weighted Kappa (excerpts from my paper posted earlier):
Two helpful web pages that can do quick Kappa calculations on data already prepared in the form of a Confusion Matrix can be found at http://www.marcovanetti.com/pages/cfmatrix and even more helpfully (this one includes quadratic-weighted kappa, and a host of other related calculations) at http://vassarstats.net/kappa.html.
Excellent and thorough definitions of Kappa, and its relevance for use in comparing two sets of outcomes for inter-rater reliability, can be found in many places. These range from simple, mechanical and statistical definitions (and some implicit assertions or assumptions that might be worth examination) to detailed examinations of various forms of Kappa (meaning also the linear and quadratic and other acknowledgements of the relationship between classification labels or classifications) and specifically the “chance-corrected” aspect of the calculation, independence assumptions and other factors that give more, or less, weight to the idea that QWK (or interclass correlation coefficient) is or is not a good measure for what we are trying to get at – the degree of fidelity between model outputs for a trained “scoring engine” and actual outputs from human judgment. See for example see:
- http://kappa.chez-alice.fr/ and
Also worthwhile are the notes and discussion on the two Kappa calculation web pages / sites noted above.
The increasing desire for more authentic assessment, and to the assessment of higher order cognitive abilities, is leading to an increased focus on performance assessment and the measurement of problem-solving skills, among other changes, in large scale educational assessment.
Present practice in production scoring for constructed response assessment items where student responses of one to several paragraphs are evaluated on well-defined rubrics by distributed teams of human scorers currently yields – in many cases — results that are barely acceptable even for course-grained, single-dimension metrics. That is, even when scoring essays on a single, four to six point scale (as for example was done in the ASAP competition for automated essay scoring on Kaggle), human scorer inter-rater reliability is marginal (or at least, less reliable than might be expected), in the sense that inter-rater agreement rates ranged from 28 to 78%, with associated quadratic-weighted Kappas ranging from .62 to .85.
Said another way, about half the time, two raters, or human (averaged) scores and AI scoring engines, will yield the same result for a simple measure, while the rest of the time, the variation can be all over map. Kappa doesn’t really tell us very much about this variation, which is a concern, because “better” (higher) Kappas might also mask abnormal or biased relationships, where models with slightly lower Kappas might, on examination, provide an intuitively more appealing result. And for scoring solutions that seek to use more detailed scoring rubrics, and to provide sub-scores and more nuanced feedback, while still solving for reliability and validity in overall scoring, the challenge of finding the “best” model for a given dataset will be even greater.
I have written a short paper that focuses solely on the problem of evaluating models that attempt to mimic human scores that are provided under the best of conditions (e.g. expert scorers not impacted by timing constraints), and addresses the question of how to define the “best” performing models. The aforementioned Kaggle competition chose Quadratic Weighted Kappa (QWK) as a means of measuring the conformance of scores reported by a model and scores assigned by human scorers. Other Kaggle competitions routinely use other metrics as well, while some critics of the ASAP competition in particular, and of the use of AES technology in general, have argued that other model performance metrics might be more appropriate.
[Update: at this point the paper simply illustrates why QWK is not by itself sufficient to definatively say one model is “better” than another by providing a counter example and some explanation of the problem.]
As a single descriptive statistic, QWK has inherent limits in describing the differences between two populations of results. Accordingly, this short note will present an example to illustrate the extent of these limitations. In short I think that – at least for a two-way comparison between a set of results from human scorers and a set of results from a trained machine learning model trying to emulate human scorers, the basic “confusion matrix” that shows a two dimensional grid with exact match results on a diagonal, and non-exact matches as outliers provides an unbeatable visualization of just how random, or not, the results of using a model might look against a set of “expert” measures.
Future efforts will considers suggested alternatives to QWK, or additional descriptive statistics that can be used in conjunction with QWK, hopefully leading to some more usable and “better” criterion for particular use cases and suggestions for further research.
Feedback welcome! Full document is linked here: SLFCR-scmv-140919a-all
 See particularly section entitled “Flawed Experimental Design I”. from Les C. Perlman’s paper at http://journalofwritingassessment.org/article.php?article=69
As noted in the prior post, and in the piece by Steve Kolowich in the Chronicle of Higher Education, April 28, 2014, “Writing Instructor, Skeptical of Automated Grading, Pits Machine vs. Machine”, we read this:
In presentations, he likes to show how the Gettysburg Address would have scored poorly on the SAT writing test. (That test is graded by human readers, but Dr. Perelman says the rubric is so rigid, and time so short, that they may as well be robots.)
Which was delightful, since in this piece covering the well-honed and oft-reported voice of a leading critic of “automated essay scoring”, clearly in sympathy to the likes of another professor and his associates behind the petition at humanreaders.org, Dr. Perelman himself acknowledges the primary flaw in most large scale uses of AES technology in educational measurement today: the “production scoring techniques” actually employed for human scoring essays are generally so far removed from all the arguments for “thoughtful, knowledgable and nuanced” engagement in student writing required for useful feedback and genuine analysis that to argue that using “AI” (aka advances in data science and machine learning) will deprive students of this necessary human care and consideration is, well, pretending that care and consideration is a hallmark of current practice and student experience.
Or said another way, this position exactly sets up the proper consideration of where and how to use “constructed response” assessment times, and how to score them (with humans, AES technology or some combination) in a way that is best for teaching & learning, and for the steady practice of critical thinking and of the craft of writing. [On second thought, where and when to use “authentic” assessment items, should be, and is, basically “always”… Trouble is, as is noted elsewhere, most “essay questions” are so poor, that with generic rubrics and half a dozen or more subjective criteria collapsed into a small set of dimensions and score points, human inter-rater reliability for anything remotely “realistic” shows that the measures are often more noise than signal… but this is another bit for another day.]
Now this comment by Dr. Perelman is not a direct quote, nor could I find anything in this Chronicle piece, the Perelman paper or even in the 2005 NYT article referenced by Kolowich (by Michael Winerip) related to the “Gettysburg Address” comment specifically, but I think the essence of the complaint is extremely valid and worth further consideration.
It is one I have previously cast as: Just how meaningful can a “essay score” be when the scorer, whatever the rubric, is grading 32 tests and hour? The “they might as well be robots” formulation stated by Kolowich to summarize Perelman’s main point seems to be saying the same thing, and it would be interesting to understand this more deeply. Is the thought that the score assigned by high speed “human robots” would have to be tied to simple surface features, mechanics and vocabulary — things that can be evaluated almost at a glance? Or that the rubrics, and number of score points and dimensions, don’t allow for any real nuance — or consistency — even if more time and thought was brought to bear on scoring? Or both? Or something else?
It would have been very interesting to hear more on this point, rather than simply here yet more (well deserved) rage at typical automated essay scoring in general. Arguments don’t need to be completely theoretical however (e.g. AES engines today do not do the work the way humans do) and lacking in nuance and texture – because, once you get past “computers don’t think”, the better, more cogent point about “the lack of construct relevance” poorly understood and usually not even stated . For now.