The Confusion Matrix: Excellent Tool

As can be seen in the paper posted previously, I continue to find the “Confusion Matrix” an excellent tool for judging the performance of machine learning (or other) models designed to predict outcomes for cases where the true outcome can also be determined.

Even simple attempts at explanations sometimes fail — witness the Wikipedia entry — and since I find this so helpful in looking at Machine Learning model performance, as noted in the prior post, I thought I’d provide a brief example here (from the paper) as well as provide pointers to great web resources for understanding “Quadratic-Weighted Kappa” — a single descriptive statistic that is often used to quantify the validity of “inter-rater reliability” in a way that is more useful, or comprehensive, than mere “accuracy”, if less descriptive by nature than these lovely visual aids.

So here are two Confusion Matrices representing output from two different theoretical machine learning models (click to expand):

Sample Confusion Matrix for Model A2QWK-ex-B2

The point of these two model performance diagrams was to show that while the two models have identical “accuracy” (or exact match rates between predicted and achieved output), the first model has a more balanced error distribution than the second.  The Second model has a “better” quadratic weighted kappa — but also demonstrates a consisted “over-scoring” bias.  I think most people would agree that the first model is “better”, despite it’s (slightly) lower QWK.




And then lastly, the promised reference information for folks simply interested in getting a better grip on Quadratic-Weighted Kappa (excerpts from my paper posted earlier):


Two helpful web pages that can do quick Kappa calculations on data already prepared in the form of a Confusion Matrix can be found at and even more helpfully (this one includes quadratic-weighted kappa, and a host of other related calculations) at

Excellent and thorough definitions of Kappa, and its relevance for use in comparing two sets of outcomes for inter-rater reliability, can be found in many places. These range from simple, mechanical and statistical definitions (and some implicit assertions or assumptions that might be worth examination) to detailed examinations of various forms of Kappa (meaning also the linear and quadratic and other acknowledgements of the relationship between classification labels or classifications) and specifically the “chance-corrected” aspect of the calculation, independence assumptions and other factors that give more, or less, weight to the idea that QWK (or interclass correlation coefficient) is or is not a good measure for what we are trying to get at – the degree of fidelity between model outputs for a trained “scoring engine” and actual outputs from human judgment. See for example see:

Also worthwhile are the notes and discussion on the two Kappa calculation web pages / sites noted above.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s