Assessment in the Classroom

I am seeing more articles about different kinds of assessment — from performance based to multi-stage — and more about getting teachers more up to speed on the basic in assessment.  Not sure I understand or would prioritize this “new learning” for teachers for much of K12, or at least grades 7 to 12, where subject manger expertise in areas like STEM, or new offerings in STEM, critical thinking and problem solving (not to mention working in algorithms and data), and would get my vote for more teacher-focused action. But then, a basic grounding in the framework of “academic measurement”, without the full crush of statistics / psychometrics, but with the wisdom and practice that has evolved since at least the time of Alfred Binet, would be a valuable thing where it is missing… but don’t they teach that in “education schools” ?    cover of book on amazon

Cambridge Assessment’s blog has a piece on the thinking, about which I am still thinking:

Meanwhile, thinking of all things UK, I am slowly digesting the coming changes to A Levels — not so much the mostly noise around how to communicate “scoring changes” — but rather, how disappointing it was to see “critical thinking” on the list of discontinued A-Level exams.  I understand the advantage of fewer, better tests, but seeing room on the list going forward for  “ancient languages” and “classical civilisation”, ancient history, government, geology, design and technology, electronics, film studies, etc. and such, it seems a shame…  At least I was still able to buy “Thinking Skills” by John Butterworth and Geoff. Thwaites (us Amazon link).   Will work hard to fit a careful read of this into my schedule before too many weeks pass..


Machine Learning for Text in the News (again): Finance

A short but interesting piece in The Economist this week entitled Machine-learning promises to shake up large swathes of finance, under a heading of “Unshackled algorithms” (located here).

Many of the usual observations and platitudes are contained herein, but I thought these quotes were notable:

  • Natural-language processing, where AI-based systems are unleashed on text, is starting to have a big impact in document-heavy parts of finance. In June 2016 JPMorgan Chase deployed software that can sift through 12,000 commercial-loan contracts in seconds, compared with the 360,000 hours it used to take lawyers and loan officers to review the contracts. [So maybe once again I am focused on one of the least remunerative aspects of a new technology…]
  • Perhaps the newest frontier for machine-learning is in trading, where it is used both to crunch market data and to select and trade portfolios of securities. The quantitative-investment strategies division at Goldman Sachs uses language processing driven by machine-learning to go through thousands of analysts’ reports on companies. It compiles an aggregate “sentiment score” based on the balance of positive to negative words. [Seems a bit simplistic, no?]

  • In other fields, however, machine-learning has game-changing potential. There is no reason to expect finance to be different. According to Jonathan Masci of Quantenstein, a machine-learning fund manager, years of work on rules-based approaches in computer vision—telling a computer how to recognise a nose, say— were swiftly eclipsed in 2012 by machine-learning processes that allowed computers to “learn” what a nose looked like from perusing millions of nasal pin-ups. Similarly, says Mr Masci, a machine-learning algorithm ought to beat conventional trading strategies based on rules set by humans. [The data point replicates, over the same timeframe, when Elijah Mayfield showed that off-the-shelf, open source machine learning could with days of work produce competitive results (for scoring essays)  the capabilities of decades-old rule-based systems (from e-Rater, Intelligent Essay Assessor and six others). See note below]


I would also note that such “supervised learning” machine learning applications that leverage NLP )natural-language processing tools, which are used in, but not by themselves good examples of, IA techniques) tools are now a standard “first stage” of Machine Learning that typically evolves toward some form of neural network-based improves, just as the “computer vision” example noted above did in subsequent iterations over the last five + years.

Good stuff.

for the Elijah Mayfield reference see:

  • Mayfield, E., & Rosé, C. P. (2013). LightSIDE: Open Source Machine Learning for Text Accessible to Non-Experts. Invited chapter in the Handbook of Automated Essay Grading.
  • Shermis, M. D., & Hamner, B. (2012). Contrasting state-of-the-art automated scoring of essays: Analysis. Annual National Council on Measurement in Education Meeting, March 29, 2012, pg. 1-54.

Critical Thinking Assessment

Often, in the context of large scale testing programs, “critical thinking assessment” is represented more by “information synthesis“, “reading comprehension“, “problem solving” or other exercises that require an examinee to make a claim and cite evidence and reasoning to support it.


In some contexts this is also called “Argumentative Writing” — much as the “analyze and argument” question on the GMAT was once a common “analytical writing” task, but only one program that comes to mind — the CAE’s Collegiate Learning Exam Plus (or Minus or Pro of whatever the marketing types want to call it this year) — does or did (at one point) break out “problem solving” and “analytic reasoning & evaluation” as dimensions on a rubric for a performance task, although they may have moved toward generalize “analysis and problem solving” dimension in current exams.


In any event, the big news today is that I have discovered EXACTLY the self-paced, student-centric, topic-organized critical thinking product and platform I have long envisioned that would replace the beloved “SRA Reading Cards” of my youth.  A group in Chicago has created a modern, digital version of this tool — organized as a set of subject mater-organized topics, grade / difficulty sequenced, that (hopefully) are as interesting and “teachful” as the SRA reading card stories and articles were. Only here, students WRITE about what they read, not just answer MCQs.  And they are taught to cite evidence, make claims, explain reasoning — even identify counter-arguments!  Great stuff.

Read more about them at



The new PISA! Still Reviewing and Reading…

So much progress, so much exemplary assessment. So much data.

This will take some some number of weeks to process…. and I will try to link to the best bits. So far the Economist treatment is looking pretty thorough.

And of course the PISA web site itself has a visualization tool… i think…

The interactive “problem solving” exercises — using “MicroDYN” systems and “finite-state automata” in particular look really interesting.

The test is here;  more results and more start here.

Another Kristof Krissis – TIMSS results used and abused

The headline pretty much tells the tale:

Nicholas Kristof Is NoTIMSS-2011math rankingst Smarter Than an Eighth-Grader – 

American kids are actually doing decently in math and interpretation, but he’s not.

By Eugene Stern

You can read the original piece here or the Slate version here, but the summative bit gets right to the point, after the why-should-we-be-surprised cherry-picking of two extreme examples to misrepresent the results of the 2011 Trends in International Mathematics and Science Study (TIMSS)  assessments; as shown here (see graphic or follow link to source).

Eugene Stern wraps up with these observations:

But Kristof isn’t willing to do either. He has a narrative of American underperformance in mind, and if the overall test results don’t fit his story, he’ll just go and find some results that do. Thus for the examples in his column, Kristof literally went and picked the two questions out of 88 on which the United States did the worst, and highlighted those in the column. (He gives a third example too, a question in which the U.S. was in the middle of the pack, but the pack did poorly, so the United States’ absolute score looks bad.) Presto! Instead of a story about kids learning stuff and doing decently on a test, we have yet another hysterical screed about Americans “struggling to compete with citizens of other countries.”

Kristof gives no suggestions for what we can actually do better, by the way. But he does offer this helpful advice:

Numeracy isn’t a sign of geekiness, but a basic requirement for intelligent discussions of public policy. Without it, politicians routinely get away with using statistics, as Mark Twain supposedly observed, the way a drunk uses a lamppost: for support rather than illumination.

So do op-ed columnists, apparently.

The propensity of some in the public eye to yearn for “stories” and narratives to support their heartfelt policy positions to the degree that they mis-understand statistics or misinterpret information is unfortunate;  but here we have the conflation of two trends — one pointed up by books like “The Manufactured Crisis: Myths, Fraud, And The Attack On America’s Public Schools” from 1996;  the other by googling “kristof disingenuous“.

In any event, if we want “trends”, 2011 data probably isn’t the place to start.  Sadly, TIMMS 2015 numbers are a probably over a year away.  But meanwhile, there is no shortage of new educational assessment data from the 50 states to dig into, so perhaps policy champions can go there and try, despite the tempting sound-bites and stories, to resist making stuff up.

iPads for Assessment – update

Mobile software, EdTech and the world of assessment are all fairly dynamic, 150301-Apple_Asseesment_with_iPad_coverand two years is almost two generations of change in some of these realms.

Smarter Balanced / SmarterApp just released an update on their technology considerations for iPad based testing, and it includes some good information as well as links to Apple resources that are helpful too.

For high-stakes testing, using iPads in “single app mode” has many obvious advantages, and tools are available to make using the devices in this mode relatively simple and easy.  Deployment tools for getting the devices into, and out of, single app mode, including by allowing programs to enter / exit SAM directly, have been implemented (as of iOS 8.1.3?) in a way that requires minimal overhead and administrative burden.

SmarterApp’s post, entitled Guidelines for the Configuration of iPads During Smarter Balanced Testing, is available here.  In particular, having test devices managed under “supervision” by Apple’s own “Apple Configurator”, or third party “MDM” (mobile device management) software, enable flexible control over access to features like the spell-check, dictionary and auto-correction that can be granular based on the specific section of an assessment. As of iOS 8.1.3, control of these features can be managed on managed iPad devices via customizable profiles — see more about that here.

SmarterApp also included a link to more detailed information on creating “custom profiles” to control specific features (please note: in “Single App Mode” features like the home button and task switching are already locked out from the user experience) for those already familiar with MDM, and a link to this Apple document entitled “Assessment with iPad”  that provides up-to-the-minute (March 2015) information that probably provides a better starting point that past information on this blog that has long since been overtaken by events.



Correlation, Causation and How to think about EMQ (Educational Measurement Quality)

How best to assess our educational assessment tools has been an ongoing question for me for some time.  Measurement is an inherently statistical activity, and unsurprisingly, figuring out how well the tools are working, and which tools work better, is therefore largely a statistical question — what tool measured what skill, ability  or knowledge best, with least error or greatest reliability, for whom, at what ability level, with what evidence,  etc. are difficult questions. They questions are without end.  But since defining and demonstrating the relevance of “evidence” to “construct” can be nuanced and difficult, it is nice that some topics are simple and direct —  such as how to do different measurements compare of what is, ostensibly, the “same thing”.  This is a bit more straightforward.

That said, if it develops that a test question scored by “expert” graders finds wide disparity between the scores assigned by different graders, many explanations are possible: is this a reflection of a lack of precision in the instrument, a lack of congruence between the thing measured and the ability used, differences in opinion and viewpoint by the two scorers — on how different aspects of the sub-domain impact the overall evaluation of the subject?  If fundamentally, two essay graders cannot agree to the score of say “quality of writing” even most of the time, on a relatively small score sale, I find it hard to move past this point to try to improve “scoring” when the basic measure itself seems to be in question.

So how to think about this “scoring” challenge itself requires a view of how two sets of scores by different scorers might “correlate” and what might constitute useful correlation and what might not.  I have previously commented on how the distribution of score from two scoring sources might be compared, and how “comparable” sets of scores could still, by some measures, reflect or hide significant bias in measurement.

This issue came to mind during a recent reading of an otherwise excellent and well documented research report 1 on the use of “e-Rater”, ETS’ tool for analyzing “essays”, TOEFL (ETS’ English language proficiency test) essays to understand how the technology developed to score a certain class of educational assessment “essays” or constructed response items (primarily or entirely in terms of “quality of writing”, it is important to note), might fare when used on evaluating the strength of second-language acquisition / ESL writing skills.

Whatever one makes of the content, there are 8 tables of data comparing how two different raters score the same essays and how those measurements compare to one-another; how the human an e-rater ratings compare (either individually, or as an average of the human scores, etc. ) as well as how these measures compare with other measures — self-evaluation, instructor evaluations (both ESL instructors and instructors in the student’s major area), and so on.  Of the various comparisons made in the eight tables, it was interesting to see that the person’s r correlations with the human essay scores to e-rater scores, or to anything or between anything else, were generally in the low end of the 0.23 to 0.45 ranges  clustered below .40 (see tables 2,3,4 and 5; correlations between professors writing judgements and e-rater  scores (table 8) were 0.15 and 0.18 (!) while higher for human ratings of their iBT essays but still only .15 to .33.  As the scoring engine was being used for a non-designed purpose, low levels of correlation were not a surprise. What was a surprise, however, was the summative comments by the authors that acknowledge  that:

As for considerations of criterion-related validity, correlations between essay scores and other indicators of writing ability were generally moderate, whether they were scored by human raters or e-rater. These moderate correlations are not unlike those found in other criterion-related validity studies (see, for example, Kuncel et al., 2001 for a meta-analysis of such studies of the GRE). They are also similar to or higher than those presented in Powers et al. (2000), comparing e-rater scores of GRE essays with a variety of other indicators. The correlations in that study ranged from .08 to .30 for a single human rater, from .07 to .31 for two human raters, and from .09 to .24 for e-rater. …

What was surprising to me was primarily that  such low levels of correlation would be described as “generally moderate” in a peer-reviewed, academic journal.  It makes me hope that higher standards are employed when AES scoring is actually used for “high stakes” testing, and that testing companies are transparent about both their scoring methodology and the statistical underpinnings of any scoring decisions made by algorithm.

I have read that there is fairly broad agreement among US citizens about what level of overall income tax seems “fair and reasonable”, and that many surveys point to something around 25% as a consensus figure if one rate had to apply to everyone.  I am wondering what parents or teachers would think is a reasonable level of inter-rater agreement for scoring a constructed response item, say on a “science” test or a “reading” test?  If an essay is scored on a 1 to 6 point scale, or a longish task is scored with zero to up to three points, and two humans grade each essay, would there be an expectation that two (qualified, trained) scorers would “agree” (meaning exactly, in case you are a psychometrician or statistician) most of the time? 2/3 of the time? all of the time? or that the correlation between any two scores for the same essay by qualified graders would be at least X? And if less than half the time two scorers would agree on the score for an essay, would this be viewed as problematic ? or “close enough”?  Of course, “it depends”, but…if you said “moderate correlation”, would that suffice?

My own view is that a substantial portion of the challenge in scoring “constructed response items” comes from the degree to which the rubrics used to score answers for these questions are too vague and leave too many rules for applying the rubric unspoken or for the grader to decide themselves.  [For example, scoring rubrics that call for distinctions between “adequate mastery”, “reasonable mastery” and “clear and consistent mastery” of a skill or some knowledge without defining these distinctions or gradations in an objective, concrete way].  Particularly where there is a single, “holistic” score, but also in more narrow scoring scenarios, there will always be component elements of the score that different graders consider, and unless there is common guidance on how the rubric is to be applied in a variety of scenarios, and scorers are trained in the same way with the same results, the level of “noise” in the score “signal” may in many instances struggle to stand out above the din and clamor or variation introduced by individual preferences, interpretations and ideas, etc., yielding measurements with unfortunate reliability and tarnishing the value attached to the assessment because, from first hand experience, people will begin to see variations in performance and ability that change from test to test, and do not reflect apparent differences in the demonstrated skills, knowledge and ability of the examinees themselves3.

One last note: I found useful when thinking about “how much inter-rater agreement would be a minimum indication of useful measurement”, I read this bit 4 which suggested a lower bound (quoting):

“Specifically, the quadratic-weighted kappa between automated and human scoring must be at least .70 (rounded normally) on data sets that show generally normal distributions, providing a threshold at which approximately half of the variance in human scoring is accounted for by [the automated scoring engine]…

This value was selected on the conceptual basis that it represents the “tipping point “at which signal outweighs noise in agreement. The identical criterion of .70 has been adopted for product-moment correlation with the same underlying rationale regarding proportion of variance accounted for by [the technology].”   [Emphasis / color added by ME!]

Of course, a) talking about “normal distributions ” when you have a four point scale has less meaning then it might in other situations… and b) people of good will can disagree on such things (in both the details and as a matter of their own ideas about how learning, and measurement, work…)



1) see Weigle, S. C. (2010). Validation of automated scores of TOEFL iBT tasks against non-test indicators of writing ability. Language Testing27(3), 335-353. or here:

2)  From wikipedia, for example (and note I am not declaring wikipedia infallible, or even authoritative, but citing it as a reflection of what some? many? at least one? people might consider reasonable):

The strength and significance of the coefficient
The following general categories indicate a quick way of interpreting a calculated r value:
0.0 to 0.2 Very weak to negligible correlation
0.2 to 0.4 Weak, low correlation (not very significant)
0.4 to 0.7 Moderate correlation
0.7 to 0.9 Strong, high correlation
0.9 to 1.0 Very strong correlation
3)  Some of this thinking owes a debt to Wayne Patience whose 1988 work on the challenges of introducing a human scored essay into the GED [Establishing and Maintaining Score Scale Stability and Reading Reliability, Presented at the annual meeting of The National Testing Network in Writing, Minneapolis, Minnesota, April  1988.] speaks to the challenges of consistent grading and the range of mechanisms — training, ongoing scorer validation to protect against drift, etc — that remain almost 30 years latter the same considerations and control mechanisms.
4)  see Williamson, D. M., Xi, X., & Breyer, F. J. (2012). A framework for evaluation and use of automated scoring. Educational Measurement: Issues and Practice31(1), 2-13.  [Widely available but I am happy to share a copy.]