The Economist ran one of their occasional EdTech updates: to titles that caught my eye:
- Summary bit: Together, technology and teachers can revamp schools
- copy of this article here
- And Technology is transforming what happens when a child goes to school
- copy of this article here
Two popular recent articles on applications of machine learning:
- Million-Dollar Prize Hints at How Machine Learning May Someday Spot Cancer (Will Knight, May 9, 2017)
- a copy of this article is here
- Machine-learning promises to shake up large swathes of finance (The Economist, May 28, 2017)
- a copy of the article is here
Google posted information about TensorFlow — the release of as open source of a key bunch of machine learning tools on their Google research blog here.
Given the great piles of multi-dimensional tables (or arrays) of data machine learning typically involves, and (at least for us primitive users) the tremendous shovel work involved in massaging and pushing around these giant piles of data file (and sorting out the arcane naming schemes devised to try to help with this problem is almost a worse problem itself),
the appellation of “Tensor Flow” as a tool to help with this is at first blush very promising. That is, rather than just a library of mathmatical algorithm implementations, I am expecting something that can help make the machine learning work itself more manageable.
I suspect that just figuring out what this is will cost me a few days… but I have much to learn.
As can be seen in the paper posted previously, I continue to find the “Confusion Matrix” an excellent tool for judging the performance of machine learning (or other) models designed to predict outcomes for cases where the true outcome can also be determined.
Even simple attempts at explanations sometimes fail — witness the Wikipedia entry — and since I find this so helpful in looking at Machine Learning model performance, as noted in the prior post, I thought I’d provide a brief example here (from the paper) as well as provide pointers to great web resources for understanding “Quadratic-Weighted Kappa” — a single descriptive statistic that is often used to quantify the validity of “inter-rater reliability” in a way that is more useful, or comprehensive, than mere “accuracy”, if less descriptive by nature than these lovely visual aids.
So here are two Confusion Matrices representing output from two different theoretical machine learning models (click to expand):
The point of these two model performance diagrams was to show that while the two models have identical “accuracy” (or exact match rates between predicted and achieved output), the first model has a more balanced error distribution than the second. The Second model has a “better” quadratic weighted kappa — but also demonstrates a consisted “over-scoring” bias. I think most people would agree that the first model is “better”, despite it’s (slightly) lower QWK.
And then lastly, the promised reference information for folks simply interested in getting a better grip on Quadratic-Weighted Kappa (excerpts from my paper posted earlier):
Two helpful web pages that can do quick Kappa calculations on data already prepared in the form of a Confusion Matrix can be found at http://www.marcovanetti.com/pages/cfmatrix and even more helpfully (this one includes quadratic-weighted kappa, and a host of other related calculations) at http://vassarstats.net/kappa.html.
Excellent and thorough definitions of Kappa, and its relevance for use in comparing two sets of outcomes for inter-rater reliability, can be found in many places. These range from simple, mechanical and statistical definitions (and some implicit assertions or assumptions that might be worth examination) to detailed examinations of various forms of Kappa (meaning also the linear and quadratic and other acknowledgements of the relationship between classification labels or classifications) and specifically the “chance-corrected” aspect of the calculation, independence assumptions and other factors that give more, or less, weight to the idea that QWK (or interclass correlation coefficient) is or is not a good measure for what we are trying to get at – the degree of fidelity between model outputs for a trained “scoring engine” and actual outputs from human judgment. See for example see:
- http://kappa.chez-alice.fr/ and
Also worthwhile are the notes and discussion on the two Kappa calculation web pages / sites noted above.
The increasing desire for more authentic assessment, and to the assessment of higher order cognitive abilities, is leading to an increased focus on performance assessment and the measurement of problem-solving skills, among other changes, in large scale educational assessment.
Present practice in production scoring for constructed response assessment items where student responses of one to several paragraphs are evaluated on well-defined rubrics by distributed teams of human scorers currently yields – in many cases — results that are barely acceptable even for course-grained, single-dimension metrics. That is, even when scoring essays on a single, four to six point scale (as for example was done in the ASAP competition for automated essay scoring on Kaggle), human scorer inter-rater reliability is marginal (or at least, less reliable than might be expected), in the sense that inter-rater agreement rates ranged from 28 to 78%, with associated quadratic-weighted Kappas ranging from .62 to .85.
Said another way, about half the time, two raters, or human (averaged) scores and AI scoring engines, will yield the same result for a simple measure, while the rest of the time, the variation can be all over map. Kappa doesn’t really tell us very much about this variation, which is a concern, because “better” (higher) Kappas might also mask abnormal or biased relationships, where models with slightly lower Kappas might, on examination, provide an intuitively more appealing result. And for scoring solutions that seek to use more detailed scoring rubrics, and to provide sub-scores and more nuanced feedback, while still solving for reliability and validity in overall scoring, the challenge of finding the “best” model for a given dataset will be even greater.
I have written a short paper that focuses solely on the problem of evaluating models that attempt to mimic human scores that are provided under the best of conditions (e.g. expert scorers not impacted by timing constraints), and addresses the question of how to define the “best” performing models. The aforementioned Kaggle competition chose Quadratic Weighted Kappa (QWK) as a means of measuring the conformance of scores reported by a model and scores assigned by human scorers. Other Kaggle competitions routinely use other metrics as well, while some critics of the ASAP competition in particular, and of the use of AES technology in general, have argued that other model performance metrics might be more appropriate.
[Update: at this point the paper simply illustrates why QWK is not by itself sufficient to definatively say one model is “better” than another by providing a counter example and some explanation of the problem.]
As a single descriptive statistic, QWK has inherent limits in describing the differences between two populations of results. Accordingly, this short note will present an example to illustrate the extent of these limitations. In short I think that – at least for a two-way comparison between a set of results from human scorers and a set of results from a trained machine learning model trying to emulate human scorers, the basic “confusion matrix” that shows a two dimensional grid with exact match results on a diagonal, and non-exact matches as outliers provides an unbeatable visualization of just how random, or not, the results of using a model might look against a set of “expert” measures.
Future efforts will considers suggested alternatives to QWK, or additional descriptive statistics that can be used in conjunction with QWK, hopefully leading to some more usable and “better” criterion for particular use cases and suggestions for further research.
Feedback welcome! Full document is linked here: SLFCR-scmv-140919a-all
 See particularly section entitled “Flawed Experimental Design I”. from Les C. Perlman’s paper at http://journalofwritingassessment.org/article.php?article=69
Despite all the various scary and crazy things people find in Google Search’s auto-complete, more often than not I get one overwhelming message.
So last night, when I started a search with “[w s v] = svd(…”
I’d not gotten very far when autocomplete said:
[w s v] = svd((repmat(sum(x.*x 1) size(x 1) 1).*x)*x’)
interestingly, one of the first hits was on StackOverflow (meaning programmers were talking about this) at
So while others may be creeped out, my reaction remains: “you are not alone” 🙂
Advancement in technology related to most everything I work is a constant, and it is a blessing and a curse that now, in addition to buying too many unread books, it is possible to sign up for courses I will never complete. Thankfully Udacity has begun charging for courses, so I am looking forward to their new Hadoop course starting in January (called Introduction to Hadoop and MapReduce) which doubtless I will fully complete (since I will be paying for it?). Well at least I can test that theory.
Not to mention, how can you take someone seriously that has not at least tried Hadoop, or can explain MapReduce?
Still looking for a great hands-on Machine Learning class that using LightSide from LightSideLabs, but I may end up having to succumb to one of the many MATLAB / OCTAVE based courses (EdX, and Andrew Ng, have one (each?) with old videos and lecture notes are still to be found.. both here at https://www.coursera.org/course/ml and http://cs229.stanford.edu/, and then again also here http://www.ml-class.org.) [Update: the old ml-class link now redirects to the coursera class. No surprise.]
Meanwhile, I can still buy the books and read them. Or not. Only now I can buy them for both Kindle and in “real” form, so I can not read them twice!