Critical Thinking Assessment

Often, in the context of large scale testing programs, “critical thinking assessment” is represented more by “information synthesis“, “reading comprehension“, “problem solving” or other exercises that require an examinee to make a claim and cite evidence and reasoning to support it.


In some contexts this is also called “Argumentative Writing” — much as the “analyze and argument” question on the GMAT was once a common “analytical writing” task, but only one program that comes to mind — the CAE’s Collegiate Learning Exam Plus (or Minus or Pro of whatever the marketing types want to call it this year) — does or did (at one point) break out “problem solving” and “analytic reasoning & evaluation” as dimensions on a rubric for a performance task, although they may have moved toward generalize “analysis and problem solving” dimension in current exams.


In any event, the big news today is that I have discovered EXACTLY the self-paced, student-centric, topic-organized critical thinking product and platform I have long envisioned that would replace the beloved “SRA Reading Cards” of my youth.  A group in Chicago has created a modern, digital version of this tool — organized as a set of subject mater-organized topics, grade / difficulty sequenced, that (hopefully) are as interesting and “teachful” as the SRA reading card stories and articles were. Only here, students WRITE about what they read, not just answer MCQs.  And they are taught to cite evidence, make claims, explain reasoning — even identify counter-arguments!  Great stuff.

Read more about them at



Status of Critical Thinking in the Workplace

Status of Critical Thinking in the Workplace –

the Most Important Skill for Business Growth

This blog post by Person is a welcome gesture highlighting Critical Thinking, both to highlight its importance to actual work and business, and to call this to the attention of Higher Ed.

Tidbits:  Specifically, when it comes to skills like critical thinking, it is consistently rated by employers as being a skill of increasing importance, and yet a recent study showed 49% of employers rate their employees’ critical thinking skills as only average or below average.

The graphic also was interesting in its display of how 2 year grads compared with 4 year grads on the measure of “critical thinking skills”: 4 year college graduates CT-skills-by-college-group-0325-0144

were more likely to be rated “excellent” for critical thinking than 2 year grads — 28 vs 4% — but for “adequate” skill level, community college graduates fared better with  73% to 63%.  Overall the recent of 2 year grads rated as deficient, but I could help but wonder if the there was a bias in the reported scores for 2 year grads (e.g, graded on a curve, as this curve looked a bit strange.

This blog will feature more references to Critical Thinking going forward, as my research into better instruction and measurement for Critical Thinking and Problem Solving skills makes progress.

Correlation, Causation and How to think about EMQ (Educational Measurement Quality)

How best to assess our educational assessment tools has been an ongoing question for me for some time.  Measurement is an inherently statistical activity, and unsurprisingly, figuring out how well the tools are working, and which tools work better, is therefore largely a statistical question — what tool measured what skill, ability  or knowledge best, with least error or greatest reliability, for whom, at what ability level, with what evidence,  etc. are difficult questions. They questions are without end.  But since defining and demonstrating the relevance of “evidence” to “construct” can be nuanced and difficult, it is nice that some topics are simple and direct —  such as how to do different measurements compare of what is, ostensibly, the “same thing”.  This is a bit more straightforward.

That said, if it develops that a test question scored by “expert” graders finds wide disparity between the scores assigned by different graders, many explanations are possible: is this a reflection of a lack of precision in the instrument, a lack of congruence between the thing measured and the ability used, differences in opinion and viewpoint by the two scorers — on how different aspects of the sub-domain impact the overall evaluation of the subject?  If fundamentally, two essay graders cannot agree to the score of say “quality of writing” even most of the time, on a relatively small score sale, I find it hard to move past this point to try to improve “scoring” when the basic measure itself seems to be in question.

So how to think about this “scoring” challenge itself requires a view of how two sets of scores by different scorers might “correlate” and what might constitute useful correlation and what might not.  I have previously commented on how the distribution of score from two scoring sources might be compared, and how “comparable” sets of scores could still, by some measures, reflect or hide significant bias in measurement.

This issue came to mind during a recent reading of an otherwise excellent and well documented research report 1 on the use of “e-Rater”, ETS’ tool for analyzing “essays”, TOEFL (ETS’ English language proficiency test) essays to understand how the technology developed to score a certain class of educational assessment “essays” or constructed response items (primarily or entirely in terms of “quality of writing”, it is important to note), might fare when used on evaluating the strength of second-language acquisition / ESL writing skills.

Whatever one makes of the content, there are 8 tables of data comparing how two different raters score the same essays and how those measurements compare to one-another; how the human an e-rater ratings compare (either individually, or as an average of the human scores, etc. ) as well as how these measures compare with other measures — self-evaluation, instructor evaluations (both ESL instructors and instructors in the student’s major area), and so on.  Of the various comparisons made in the eight tables, it was interesting to see that the person’s r correlations with the human essay scores to e-rater scores, or to anything or between anything else, were generally in the low end of the 0.23 to 0.45 ranges  clustered below .40 (see tables 2,3,4 and 5; correlations between professors writing judgements and e-rater  scores (table 8) were 0.15 and 0.18 (!) while higher for human ratings of their iBT essays but still only .15 to .33.  As the scoring engine was being used for a non-designed purpose, low levels of correlation were not a surprise. What was a surprise, however, was the summative comments by the authors that acknowledge  that:

As for considerations of criterion-related validity, correlations between essay scores and other indicators of writing ability were generally moderate, whether they were scored by human raters or e-rater. These moderate correlations are not unlike those found in other criterion-related validity studies (see, for example, Kuncel et al., 2001 for a meta-analysis of such studies of the GRE). They are also similar to or higher than those presented in Powers et al. (2000), comparing e-rater scores of GRE essays with a variety of other indicators. The correlations in that study ranged from .08 to .30 for a single human rater, from .07 to .31 for two human raters, and from .09 to .24 for e-rater. …

What was surprising to me was primarily that  such low levels of correlation would be described as “generally moderate” in a peer-reviewed, academic journal.  It makes me hope that higher standards are employed when AES scoring is actually used for “high stakes” testing, and that testing companies are transparent about both their scoring methodology and the statistical underpinnings of any scoring decisions made by algorithm.

I have read that there is fairly broad agreement among US citizens about what level of overall income tax seems “fair and reasonable”, and that many surveys point to something around 25% as a consensus figure if one rate had to apply to everyone.  I am wondering what parents or teachers would think is a reasonable level of inter-rater agreement for scoring a constructed response item, say on a “science” test or a “reading” test?  If an essay is scored on a 1 to 6 point scale, or a longish task is scored with zero to up to three points, and two humans grade each essay, would there be an expectation that two (qualified, trained) scorers would “agree” (meaning exactly, in case you are a psychometrician or statistician) most of the time? 2/3 of the time? all of the time? or that the correlation between any two scores for the same essay by qualified graders would be at least X? And if less than half the time two scorers would agree on the score for an essay, would this be viewed as problematic ? or “close enough”?  Of course, “it depends”, but…if you said “moderate correlation”, would that suffice?

My own view is that a substantial portion of the challenge in scoring “constructed response items” comes from the degree to which the rubrics used to score answers for these questions are too vague and leave too many rules for applying the rubric unspoken or for the grader to decide themselves.  [For example, scoring rubrics that call for distinctions between “adequate mastery”, “reasonable mastery” and “clear and consistent mastery” of a skill or some knowledge without defining these distinctions or gradations in an objective, concrete way].  Particularly where there is a single, “holistic” score, but also in more narrow scoring scenarios, there will always be component elements of the score that different graders consider, and unless there is common guidance on how the rubric is to be applied in a variety of scenarios, and scorers are trained in the same way with the same results, the level of “noise” in the score “signal” may in many instances struggle to stand out above the din and clamor or variation introduced by individual preferences, interpretations and ideas, etc., yielding measurements with unfortunate reliability and tarnishing the value attached to the assessment because, from first hand experience, people will begin to see variations in performance and ability that change from test to test, and do not reflect apparent differences in the demonstrated skills, knowledge and ability of the examinees themselves3.

One last note: I found useful when thinking about “how much inter-rater agreement would be a minimum indication of useful measurement”, I read this bit 4 which suggested a lower bound (quoting):

“Specifically, the quadratic-weighted kappa between automated and human scoring must be at least .70 (rounded normally) on data sets that show generally normal distributions, providing a threshold at which approximately half of the variance in human scoring is accounted for by [the automated scoring engine]…

This value was selected on the conceptual basis that it represents the “tipping point “at which signal outweighs noise in agreement. The identical criterion of .70 has been adopted for product-moment correlation with the same underlying rationale regarding proportion of variance accounted for by [the technology].”   [Emphasis / color added by ME!]

Of course, a) talking about “normal distributions ” when you have a four point scale has less meaning then it might in other situations… and b) people of good will can disagree on such things (in both the details and as a matter of their own ideas about how learning, and measurement, work…)



1) see Weigle, S. C. (2010). Validation of automated scores of TOEFL iBT tasks against non-test indicators of writing ability. Language Testing27(3), 335-353. or here:

2)  From wikipedia, for example (and note I am not declaring wikipedia infallible, or even authoritative, but citing it as a reflection of what some? many? at least one? people might consider reasonable):

The strength and significance of the coefficient
The following general categories indicate a quick way of interpreting a calculated r value:
0.0 to 0.2 Very weak to negligible correlation
0.2 to 0.4 Weak, low correlation (not very significant)
0.4 to 0.7 Moderate correlation
0.7 to 0.9 Strong, high correlation
0.9 to 1.0 Very strong correlation
3)  Some of this thinking owes a debt to Wayne Patience whose 1988 work on the challenges of introducing a human scored essay into the GED [Establishing and Maintaining Score Scale Stability and Reading Reliability, Presented at the annual meeting of The National Testing Network in Writing, Minneapolis, Minnesota, April  1988.] speaks to the challenges of consistent grading and the range of mechanisms — training, ongoing scorer validation to protect against drift, etc — that remain almost 30 years latter the same considerations and control mechanisms.
4)  see Williamson, D. M., Xi, X., & Breyer, F. J. (2012). A framework for evaluation and use of automated scoring. Educational Measurement: Issues and Practice31(1), 2-13.  [Widely available but I am happy to share a copy.]