TensorFlow is released: Google Machine Learning for Everyone

2FNLTensorFlow_logoGoogle posted information about TensorFlow —  the release of as open source of a key bunch of machine learning tools on their Google research blog here.

Given the great piles of multi-dimensional tables (or arrays) of data machine learning typically involves, and (at least for us primitive users) the tremendous shovel work involved in massaging and pushing around these giant piles of data file (and sorting out the arcane naming schemes devised to try to help with this problem is almost a worse problem itself),

the appellation of “Tensor Flow” as a tool to help with this is at first blush very promising. That is, rather than just a library of mathmatical algorithm implementations, I am expecting something that can help make the machine learning work itself more manageable.

I suspect that just figuring out what this is will cost me a few days… but I have much to learn.



The Confusion Matrix: Excellent Tool

As can be seen in the paper posted previously, I continue to find the “Confusion Matrix” an excellent tool for judging the performance of machine learning (or other) models designed to predict outcomes for cases where the true outcome can also be determined.

Even simple attempts at explanations sometimes fail — witness the Wikipedia entry — and since I find this so helpful in looking at Machine Learning model performance, as noted in the prior post, I thought I’d provide a brief example here (from the paper) as well as provide pointers to great web resources for understanding “Quadratic-Weighted Kappa” — a single descriptive statistic that is often used to quantify the validity of “inter-rater reliability” in a way that is more useful, or comprehensive, than mere “accuracy”, if less descriptive by nature than these lovely visual aids.

So here are two Confusion Matrices representing output from two different theoretical machine learning models (click to expand):

Sample Confusion Matrix for Model A2QWK-ex-B2

The point of these two model performance diagrams was to show that while the two models have identical “accuracy” (or exact match rates between predicted and achieved output), the first model has a more balanced error distribution than the second.  The Second model has a “better” quadratic weighted kappa — but also demonstrates a consisted “over-scoring” bias.  I think most people would agree that the first model is “better”, despite it’s (slightly) lower QWK.




And then lastly, the promised reference information for folks simply interested in getting a better grip on Quadratic-Weighted Kappa (excerpts from my paper posted earlier):


Two helpful web pages that can do quick Kappa calculations on data already prepared in the form of a Confusion Matrix can be found at http://www.marcovanetti.com/pages/cfmatrix and even more helpfully (this one includes quadratic-weighted kappa, and a host of other related calculations) at http://vassarstats.net/kappa.html.

Excellent and thorough definitions of Kappa, and its relevance for use in comparing two sets of outcomes for inter-rater reliability, can be found in many places. These range from simple, mechanical and statistical definitions (and some implicit assertions or assumptions that might be worth examination) to detailed examinations of various forms of Kappa (meaning also the linear and quadratic and other acknowledgements of the relationship between classification labels or classifications) and specifically the “chance-corrected” aspect of the calculation, independence assumptions and other factors that give more, or less, weight to the idea that QWK (or interclass correlation coefficient) is or is not a good measure for what we are trying to get at – the degree of fidelity between model outputs for a trained “scoring engine” and actual outputs from human judgment. See for example see:

Also worthwhile are the notes and discussion on the two Kappa calculation web pages / sites noted above.

Scoring Long-form Constructed Response: Statistical Challenges in Model Validation

SLFCR-SCMV-Cover20140901The increasing desire for more authentic assessment, and to the assessment of higher order cognitive abilities, is leading to an increased focus on performance assessment and the measurement of problem-solving skills, among other changes, in large scale educational assessment.

Present practice in production scoring for constructed response assessment items where student responses of one to several paragraphs are evaluated on well-defined rubrics by distributed teams of human scorers currently yields – in many cases — results that are barely acceptable even for course-grained, single-dimension metrics. That is, even when scoring essays on a single, four to six point scale (as for example was done in the ASAP competition for automated essay scoring on Kaggle[1]), human scorer inter-rater reliability is marginal (or at least, less reliable than might be expected), in the sense that inter-rater agreement rates ranged from 28 to 78%, with associated quadratic-weighted Kappas ranging from .62 to .85.

Said another way, about half the time, two raters, or human (averaged) scores and AI scoring engines, will yield the same result for a simple measure, while the rest of the time, the variation can be all over map. Kappa doesn’t really tell us very much about this variation, which is a concern, because “better” (higher) Kappas might also mask abnormal or biased relationships, where models with slightly lower Kappas might, on examination, provide an intuitively more appealing result. And for scoring solutions that seek to use more detailed scoring rubrics, and to provide sub-scores and more nuanced feedback, while still solving for reliability and validity in overall scoring, the challenge of finding the “best” model for a given dataset will be even greater.

I have written a short paper that focuses solely on the problem of evaluating models that attempt to mimic human scores that are provided under the best of conditions (e.g. expert scorers not impacted by timing constraints), and addresses the question of how to define the “best” performing models.  The aforementioned Kaggle competition chose Quadratic Weighted Kappa (QWK) as a means of measuring the conformance of scores reported by a model and scores assigned by human scorers. Other Kaggle competitions routinely use other metrics as well[2], while some critics of the ASAP competition in particular, and of the use of AES technology in general, have argued that other model performance metrics might be more appropriate[3].

[Update: at this point the paper simply illustrates why QWK is not by itself sufficient to definatively say one model is “better” than another by providing a counter example and some explanation of the problem.]

As a single descriptive statistic, QWK has inherent limits in describing the differences between two populations of results. Accordingly, this short note will present an example to illustrate the extent of these limitations. In short I think that – at least for a two-way comparison between a set of results from human scorers and a set of results from a trained machine learning model trying to emulate human scorers, the basic “confusion matrix” that shows a two dimensional grid with exact match results on a diagonal, and non-exact matches as outliers provides an unbeatable visualization of just how random, or not, the results of using a model might look against a set of “expert” measures.

Future efforts will considers suggested alternatives to QWK, or additional descriptive statistics that can be used in conjunction with QWK, hopefully leading to some more usable and “better” criterion for particular use cases and suggestions for further research.

Feedback welcome!  Full document is linked here: SLFCR-scmv-140919a-all



[1] See https://www.kaggle.com/c/asap-aes

[2] See https://www.kaggle.com/wiki/Metrics

[3] See particularly section entitled “Flawed Experimental Design I”. from Les C. Perlman’s paper at http://journalofwritingassessment.org/article.php?article=69


Learning from Auto-complete in Google Search

Despite all the various scary and crazy things people find in Google Search’s auto-complete, more often than not I get one overwhelming message.

So last night, when I started a search with “[w s v] = svd(…

I’d not gotten very far when autocomplete said:

[w s v] = svd((repmat(sum(x.*x 1) size(x 1) 1).*x)*x’)

Which is completely Crazy! or not. i means many people got to the same point in Andrew Ng‘s machine learning lecture, saw the same example on a slide (a one line MATLAB code example) and

interestingly, one of the first hits was on StackOverflow (meaning programmers were talking about this) at


So while others may be creeped out, my reaction remains:  “you are not alone” 🙂

So Many Courses, So Little Time

Advancement in technology related to most everything I work is a constant, and it is a Imageblessing and a curse that now, in addition to buying too many unread books, it is possible to sign up for courses I will never complete.  Thankfully Udacity has begun charging for courses, so I am looking forward to their new Hadoop course starting in January (called Introduction to Hadoop and MapReduce) which doubtless I will fully complete (since I will be paying for it?).  Well at least I can test that theory.

Not to mention, how can you take someone seriously that has not at least tried Hadoop, or can explain MapReduce?

Still looking for a great hands-on Machine Learning class that using LightSide from LightSideLabs, but I may end up having to succumb to one of the many MATLAB / OCTAVE based courses (EdX, and Andrew Ng, have one (each?) with old videos and lecture notes are still to be found.. both here at https://www.coursera.org/course/ml and http://cs229.stanford.edu/, and then again also here http://www.ml-class.org.)   [Update: the old ml-class link now redirects to the coursera class. No surprise.]

Meanwhile, I can still buy the books and read them. Or not. Only now I can buy them for both Kindle and in “real” form, so I can not read them twice!

That‘s progress!

iPad Educational App Tutorial: Prequel

How To Create Your Own iPad Learning App

With this post I am going to begin a 9 part tutorial on creating a simple iPad learning app.

I created my first iPad app to help me learn Chinese characters (Hanzi).  As a student of the language, I find it helpful to go after this ambitious learning project from many directions.  Language study involves speaking, listening, reading and writing, and each can be daunting — even without the challenge of learning a language that is tonal (if yours is not) or uses a completely different writing system.  As an advanced beginner, I have some basic vocabulary, and having studied reading and writing, I have a basic knowledge of strokes and characters, but this (along with pronunciation) is my weakest area. Hence this App.

My App will be a relatively simple one, something that uses a “game board” with “word tiles” and “picture tiles”.  As flash-card type prompt will require selection of a corresponding word or image, and there should be “hints” to help reinforce the learning. In fact this topic (learning characters is somewhat daunting, so I have also written a companion book for study.)

While the App is relatively simple, it is not a “Hello World” app or something a first time programmer or even a novice on this Platform will want to use to learn Xcode, to understand what UI Kit is, and to learn how to build and deploy an app to the store.  For these basics, I highly recommend a good beginning book such as “iPad Apps for Dummies” and the like.  Here is a link to a page that lists the basic books I used to get started — where the books walk you through step by step how to use Xcode, how to become an Apple Developer if you want to sell your Apps, and how to create, compile, debug and test a basic App on both your device and in the simulator.  This tutorial is not intended to replace or compete with these materials.  Finding a good, working example complete with source code and an explanation, as several others have done (see the links on this blog), is far and away the most useful sort of training I have found once I got past the basics.

Back to the basic choice of how to create and App for iPad —  what sort of ‘instructional designs’ should we consider?  If you’ve spent any time at all looking at the zillions of apps out there, it is clear there are many choices and approaches to learning software and games.  For learning to recognize Hanzi, I developed a learning design that was a hybrid of “flash cards” and the familiar “memory” game.  I also wanted a design that lent itself to pairs of users that were helping each other — so that a Chinese speaker could practice English, and an English speaker could practice recognizing Chinese.  I was also deliberately laying the groundwork for apps for learning words in other languages and for speakers of other languages. More about my specific instructional design is here.

As this App was my first commercial iPad product, I think it is suitable for other newbies learning the Xcode, IOS, ObjectiveC platform.  And for this reason the basic application design pattern selected was a) a “View” app, or one with a single playing surface (and pop-over windows for things like help, instructions, some settings, etc.) and b) based on “UIKit” or simple user interaction controls.   Programming using UIKit and Apple’s built-in User Interface Controls is the most basic way to  interact with IOS users; other approaches include “2D graphics” — like pinball / arcade style games and “jumper” games where a game figure is moved through “levels”  and around the screen (in 2D or 3D).  I have an arcade-style game in process that uses the CoCos2D graphics package libraries, as well as the Box2D physics package (so that game elements can interact with each other the way physical objects do, without each developer having to invent the code for gravit, for bouncing off surfaces, for collision detection, and whatnot).  There are some good tutorials for these sorts of games out there — see the “blog roll” and sites I link to from this blog for good examples. Such games are a good bit more complex, so doing a simple UI Kit based app is most often where new developers start.

The app I will cover in this tutorial is already in the store — you can find it as “Chinese Words 4 Beginners” at this LINK.  I will also make arrangements to have the source code available for interested learners.  The plan for the tutorial is something like this:

  1. Structure of a UI Kit-based View app
  2. Laying out the game board
  3. Creating and using Data
  4. Creating and using images
  5. Controlling Game play
  6. Adding scoring and top scorers.
  7. Adding help and demo video
  8. Internationalization
  9. Gestures, animation, application settings

If you have any questions or specific suggestions before I start, feel free to send me a note.  A link to info about this app specifically is located here.  [You will also find a low-res PDF file of the companion book for learning Chinese characters at this page.]  A link to the web site for this entire family of Apps — word recognition and learning for Japanese, Korean and Chinese for English, French, Spanish, Italian, and German speakers, and in fact every combination of any two of these languages, is located here.