Critical Thinking Assessment

Often, in the context of large scale testing programs, “critical thinking assessment” is represented more by “information synthesis“, “reading comprehension“, “problem solving” or other exercises that require an examinee to make a claim and cite evidence and reasoning to support it.


In some contexts this is also called “Argumentative Writing” — much as the “analyze and argument” question on the GMAT was once a common “analytical writing” task, but only one program that comes to mind — the CAE’s Collegiate Learning Exam Plus (or Minus or Pro of whatever the marketing types want to call it this year) — does or did (at one point) break out “problem solving” and “analytic reasoning & evaluation” as dimensions on a rubric for a performance task, although they may have moved toward generalize “analysis and problem solving” dimension in current exams.


In any event, the big news today is that I have discovered EXACTLY the self-paced, student-centric, topic-organized critical thinking product and platform I have long envisioned that would replace the beloved “SRA Reading Cards” of my youth.  A group in Chicago has created a modern, digital version of this tool — organized as a set of subject mater-organized topics, grade / difficulty sequenced, that (hopefully) are as interesting and “teachful” as the SRA reading card stories and articles were. Only here, students WRITE about what they read, not just answer MCQs.  And they are taught to cite evidence, make claims, explain reasoning — even identify counter-arguments!  Great stuff.

Read more about them at



The new PISA! Still Reviewing and Reading…

So much progress, so much exemplary assessment. So much data.

This will take some some number of weeks to process…. and I will try to link to the best bits. So far the Economist treatment is looking pretty thorough.

And of course the PISA web site itself has a visualization tool… i think…

The interactive “problem solving” exercises — using “MicroDYN” systems and “finite-state automata” in particular look really interesting.

The test is here;  more results and more start here.

Another Kristof Krissis – TIMSS results used and abused

The headline pretty much tells the tale:

Nicholas Kristof Is NoTIMSS-2011math rankingst Smarter Than an Eighth-Grader – 

American kids are actually doing decently in math and interpretation, but he’s not.

By Eugene Stern

You can read the original piece here or the Slate version here, but the summative bit gets right to the point, after the why-should-we-be-surprised cherry-picking of two extreme examples to misrepresent the results of the 2011 Trends in International Mathematics and Science Study (TIMSS)  assessments; as shown here (see graphic or follow link to source).

Eugene Stern wraps up with these observations:

But Kristof isn’t willing to do either. He has a narrative of American underperformance in mind, and if the overall test results don’t fit his story, he’ll just go and find some results that do. Thus for the examples in his column, Kristof literally went and picked the two questions out of 88 on which the United States did the worst, and highlighted those in the column. (He gives a third example too, a question in which the U.S. was in the middle of the pack, but the pack did poorly, so the United States’ absolute score looks bad.) Presto! Instead of a story about kids learning stuff and doing decently on a test, we have yet another hysterical screed about Americans “struggling to compete with citizens of other countries.”

Kristof gives no suggestions for what we can actually do better, by the way. But he does offer this helpful advice:

Numeracy isn’t a sign of geekiness, but a basic requirement for intelligent discussions of public policy. Without it, politicians routinely get away with using statistics, as Mark Twain supposedly observed, the way a drunk uses a lamppost: for support rather than illumination.

So do op-ed columnists, apparently.

The propensity of some in the public eye to yearn for “stories” and narratives to support their heartfelt policy positions to the degree that they mis-understand statistics or misinterpret information is unfortunate;  but here we have the conflation of two trends — one pointed up by books like “The Manufactured Crisis: Myths, Fraud, And The Attack On America’s Public Schools” from 1996;  the other by googling “kristof disingenuous“.

In any event, if we want “trends”, 2011 data probably isn’t the place to start.  Sadly, TIMMS 2015 numbers are a probably over a year away.  But meanwhile, there is no shortage of new educational assessment data from the 50 states to dig into, so perhaps policy champions can go there and try, despite the tempting sound-bites and stories, to resist making stuff up.

iPads for Assessment – update

Mobile software, EdTech and the world of assessment are all fairly dynamic, 150301-Apple_Asseesment_with_iPad_coverand two years is almost two generations of change in some of these realms.

Smarter Balanced / SmarterApp just released an update on their technology considerations for iPad based testing, and it includes some good information as well as links to Apple resources that are helpful too.

For high-stakes testing, using iPads in “single app mode” has many obvious advantages, and tools are available to make using the devices in this mode relatively simple and easy.  Deployment tools for getting the devices into, and out of, single app mode, including by allowing programs to enter / exit SAM directly, have been implemented (as of iOS 8.1.3?) in a way that requires minimal overhead and administrative burden.

SmarterApp’s post, entitled Guidelines for the Configuration of iPads During Smarter Balanced Testing, is available here.  In particular, having test devices managed under “supervision” by Apple’s own “Apple Configurator”, or third party “MDM” (mobile device management) software, enable flexible control over access to features like the spell-check, dictionary and auto-correction that can be granular based on the specific section of an assessment. As of iOS 8.1.3, control of these features can be managed on managed iPad devices via customizable profiles — see more about that here.

SmarterApp also included a link to more detailed information on creating “custom profiles” to control specific features (please note: in “Single App Mode” features like the home button and task switching are already locked out from the user experience) for those already familiar with MDM, and a link to this Apple document entitled “Assessment with iPad”  that provides up-to-the-minute (March 2015) information that probably provides a better starting point that past information on this blog that has long since been overtaken by events.



Correlation, Causation and How to think about EMQ (Educational Measurement Quality)

How best to assess our educational assessment tools has been an ongoing question for me for some time.  Measurement is an inherently statistical activity, and unsurprisingly, figuring out how well the tools are working, and which tools work better, is therefore largely a statistical question — what tool measured what skill, ability  or knowledge best, with least error or greatest reliability, for whom, at what ability level, with what evidence,  etc. are difficult questions. They questions are without end.  But since defining and demonstrating the relevance of “evidence” to “construct” can be nuanced and difficult, it is nice that some topics are simple and direct —  such as how to do different measurements compare of what is, ostensibly, the “same thing”.  This is a bit more straightforward.

That said, if it develops that a test question scored by “expert” graders finds wide disparity between the scores assigned by different graders, many explanations are possible: is this a reflection of a lack of precision in the instrument, a lack of congruence between the thing measured and the ability used, differences in opinion and viewpoint by the two scorers — on how different aspects of the sub-domain impact the overall evaluation of the subject?  If fundamentally, two essay graders cannot agree to the score of say “quality of writing” even most of the time, on a relatively small score sale, I find it hard to move past this point to try to improve “scoring” when the basic measure itself seems to be in question.

So how to think about this “scoring” challenge itself requires a view of how two sets of scores by different scorers might “correlate” and what might constitute useful correlation and what might not.  I have previously commented on how the distribution of score from two scoring sources might be compared, and how “comparable” sets of scores could still, by some measures, reflect or hide significant bias in measurement.

This issue came to mind during a recent reading of an otherwise excellent and well documented research report 1 on the use of “e-Rater”, ETS’ tool for analyzing “essays”, TOEFL (ETS’ English language proficiency test) essays to understand how the technology developed to score a certain class of educational assessment “essays” or constructed response items (primarily or entirely in terms of “quality of writing”, it is important to note), might fare when used on evaluating the strength of second-language acquisition / ESL writing skills.

Whatever one makes of the content, there are 8 tables of data comparing how two different raters score the same essays and how those measurements compare to one-another; how the human an e-rater ratings compare (either individually, or as an average of the human scores, etc. ) as well as how these measures compare with other measures — self-evaluation, instructor evaluations (both ESL instructors and instructors in the student’s major area), and so on.  Of the various comparisons made in the eight tables, it was interesting to see that the person’s r correlations with the human essay scores to e-rater scores, or to anything or between anything else, were generally in the low end of the 0.23 to 0.45 ranges  clustered below .40 (see tables 2,3,4 and 5; correlations between professors writing judgements and e-rater  scores (table 8) were 0.15 and 0.18 (!) while higher for human ratings of their iBT essays but still only .15 to .33.  As the scoring engine was being used for a non-designed purpose, low levels of correlation were not a surprise. What was a surprise, however, was the summative comments by the authors that acknowledge  that:

As for considerations of criterion-related validity, correlations between essay scores and other indicators of writing ability were generally moderate, whether they were scored by human raters or e-rater. These moderate correlations are not unlike those found in other criterion-related validity studies (see, for example, Kuncel et al., 2001 for a meta-analysis of such studies of the GRE). They are also similar to or higher than those presented in Powers et al. (2000), comparing e-rater scores of GRE essays with a variety of other indicators. The correlations in that study ranged from .08 to .30 for a single human rater, from .07 to .31 for two human raters, and from .09 to .24 for e-rater. …

What was surprising to me was primarily that  such low levels of correlation would be described as “generally moderate” in a peer-reviewed, academic journal.  It makes me hope that higher standards are employed when AES scoring is actually used for “high stakes” testing, and that testing companies are transparent about both their scoring methodology and the statistical underpinnings of any scoring decisions made by algorithm.

I have read that there is fairly broad agreement among US citizens about what level of overall income tax seems “fair and reasonable”, and that many surveys point to something around 25% as a consensus figure if one rate had to apply to everyone.  I am wondering what parents or teachers would think is a reasonable level of inter-rater agreement for scoring a constructed response item, say on a “science” test or a “reading” test?  If an essay is scored on a 1 to 6 point scale, or a longish task is scored with zero to up to three points, and two humans grade each essay, would there be an expectation that two (qualified, trained) scorers would “agree” (meaning exactly, in case you are a psychometrician or statistician) most of the time? 2/3 of the time? all of the time? or that the correlation between any two scores for the same essay by qualified graders would be at least X? And if less than half the time two scorers would agree on the score for an essay, would this be viewed as problematic ? or “close enough”?  Of course, “it depends”, but…if you said “moderate correlation”, would that suffice?

My own view is that a substantial portion of the challenge in scoring “constructed response items” comes from the degree to which the rubrics used to score answers for these questions are too vague and leave too many rules for applying the rubric unspoken or for the grader to decide themselves.  [For example, scoring rubrics that call for distinctions between “adequate mastery”, “reasonable mastery” and “clear and consistent mastery” of a skill or some knowledge without defining these distinctions or gradations in an objective, concrete way].  Particularly where there is a single, “holistic” score, but also in more narrow scoring scenarios, there will always be component elements of the score that different graders consider, and unless there is common guidance on how the rubric is to be applied in a variety of scenarios, and scorers are trained in the same way with the same results, the level of “noise” in the score “signal” may in many instances struggle to stand out above the din and clamor or variation introduced by individual preferences, interpretations and ideas, etc., yielding measurements with unfortunate reliability and tarnishing the value attached to the assessment because, from first hand experience, people will begin to see variations in performance and ability that change from test to test, and do not reflect apparent differences in the demonstrated skills, knowledge and ability of the examinees themselves3.

One last note: I found useful when thinking about “how much inter-rater agreement would be a minimum indication of useful measurement”, I read this bit 4 which suggested a lower bound (quoting):

“Specifically, the quadratic-weighted kappa between automated and human scoring must be at least .70 (rounded normally) on data sets that show generally normal distributions, providing a threshold at which approximately half of the variance in human scoring is accounted for by [the automated scoring engine]…

This value was selected on the conceptual basis that it represents the “tipping point “at which signal outweighs noise in agreement. The identical criterion of .70 has been adopted for product-moment correlation with the same underlying rationale regarding proportion of variance accounted for by [the technology].”   [Emphasis / color added by ME!]

Of course, a) talking about “normal distributions ” when you have a four point scale has less meaning then it might in other situations… and b) people of good will can disagree on such things (in both the details and as a matter of their own ideas about how learning, and measurement, work…)



1) see Weigle, S. C. (2010). Validation of automated scores of TOEFL iBT tasks against non-test indicators of writing ability. Language Testing27(3), 335-353. or here:

2)  From wikipedia, for example (and note I am not declaring wikipedia infallible, or even authoritative, but citing it as a reflection of what some? many? at least one? people might consider reasonable):

The strength and significance of the coefficient
The following general categories indicate a quick way of interpreting a calculated r value:
0.0 to 0.2 Very weak to negligible correlation
0.2 to 0.4 Weak, low correlation (not very significant)
0.4 to 0.7 Moderate correlation
0.7 to 0.9 Strong, high correlation
0.9 to 1.0 Very strong correlation
3)  Some of this thinking owes a debt to Wayne Patience whose 1988 work on the challenges of introducing a human scored essay into the GED [Establishing and Maintaining Score Scale Stability and Reading Reliability, Presented at the annual meeting of The National Testing Network in Writing, Minneapolis, Minnesota, April  1988.] speaks to the challenges of consistent grading and the range of mechanisms — training, ongoing scorer validation to protect against drift, etc — that remain almost 30 years latter the same considerations and control mechanisms.
4)  see Williamson, D. M., Xi, X., & Breyer, F. J. (2012). A framework for evaluation and use of automated scoring. Educational Measurement: Issues and Practice31(1), 2-13.  [Widely available but I am happy to share a copy.]

Assessment Challenge: 2) Better tools to measure critical thinking

The limitations of selected response questions aside, many researchers have spend years determining what skills are the most important for developing students to cultivate, and additional years figuring out how to measure these qualities.  There seems to be a pretty clear (and perhaps longstanding) consensus on both these topics, which, given their (lack of) visibility in the marketplace, is pretty surprising.  The consensus answers are a) students should learn “critical thinking” and “problem solving” skills — and yes, there is much debate about how to define these things, but there is also much practical work completed on doing so; and b) these skills are best assessed in the context of a range of situations broadly described as “performance tasks”, but which can, in their simplest form, rely on written responses from examinees.


The best writing I’ve seen of late on this topic comes from the Council for Aid to Education (and associated individuals).  Their most recent work includes an item from May 2013 that called out the problem of “multiple-choice testing” problem out with some clarity, and offered at least a partial solution. This monograph is entitled “The Case for Critical-Thinking and Performance Assessment“, authored by well known and articulate voices in assessment and education policy. Even as most major test publishing companies in the US education market strive to move their instruments toward calibrating higher-order skills and knowledge with the rationale that “twenty-first century skills” need for the “jobs of the future” will be a lot more cognitively demanding, K12 testing remains anchored in MCQ-land with recent signs of retreat from prior ambition for greater use of constructed response in both math and ELA by the eading assessment consortia.


This second piece, by Benjamin Rogers et al, goes directly at almost every aspect of the problems identified in the prior article on the failings of current standardized testing efforts — advocating as it does the use of performance tasks to measure critical thinking skills, rather than traditional approaches like selected-response questions used to measure “domain mastery”-related to specific course subject content at of specific points in time.

So with these two very sold examinations of both the “problem” — high stakes multiple-choice tests — and of potential solutions — “performance tasks” with long form constructed response — an obvious stampede might be expected toward a resolution. Sadly this will not be the case anytime soon, and maybe not even several years from now, unless one or more practical, economic and high quality solutions emerge to an unstated (or under-stated) aspect of the “performance task”, “critical-thinking” assessment approach: the lack of reliable, accurate and cost-effective scoring for such assessments.

The workable approach to the horrible economics, and at times questionable psychometrics,  of trying to score performance tasks at scale in a reliable and consistent way is not presently at hand.  Progress is being made, and ideas for improvement will be the subject of a future post.


Assessment Challenge: 1) Pushback on MCQ will not abate

The obviousness of the fact that selected response questions do not require the same sort of thought processes as actual problem solving skills is fairly apparent to everyone from students to parents, teachers to administrators, and even to psychometrics themselves.  However on one group truly appreciates the core value they provide — reliable and consistent measurement if nothing else, but provable relevant and capable in the task of educational measurement as well. And that group, psychometricians, is vanishingly small and at times it seems, even less influential.

Recently published article

Recently published article

The inspiration for this thought is the recent publication of a scholarly article on how and why the “School reform” movement in US education is loosing steam if not failing outright. It lays the the blame for this failure on the (over)use of the all-to-familiar “multiple-choice question”-based “standardized tests”, tests which generally inspire no end of worthy objection. The objections are all too familiar — “life is not a multiple choice activity” (and other aspects of construct irrelevancy); “teaching to the test” that takes away from the valuable things teachers otherwise want to teach (or the tension between rote and formulaic learning versus problem-solving and critical thinking skills) which not only undermines proper instruction but results in students being rated for the “wrong” skills — among many others (but these are perhaps the best objections).

Standardized testing still essentially means “selected response” testing, which still largely raises hackles over it’s “in-authentic” nature, its essential failure to represent constructs directly and well, and — at least for me — the issue that when constructed response items are used, they are today (early 2015) most often created, validated and scored by processes that are far to generous in their willingness to extract (or claim) measurement from the noise in their output. The first two of these issues are raised well and are thoughtfully discussed in the American Scholar piece (and many others of late); the value add of the Mike Rose piece is that it ties the scourge of high-stakes multiple-guess questions directly key elements in the push-back against educational reform.  This latter issue — about the difficulty of scoring other sorts of questions — will be the subject of a future blog post.   AmericanScholar_SchoolReformFails_the_test-winter2015-MikeRose-141210c

Meanwhile my thinking is that even re-vamped, new and improved standardized tests that continue to rely almost entirely on selected response questions — such as are reflected in the newest SBAC and PARCC sample items — will struggle for legitimacy and acceptance, which is too bad, but also has simply “kicked the can” further down the field for others to solve.