Machine Learning for Text in the News (again): Finance

A short but interesting piece in The Economist this week entitled Machine-learning promises to shake up large swathes of finance, under a heading of “Unshackled algorithms” (located here).

Many of the usual observations and platitudes are contained herein, but I thought these quotes were notable:

  • Natural-language processing, where AI-based systems are unleashed on text, is starting to have a big impact in document-heavy parts of finance. In June 2016 JPMorgan Chase deployed software that can sift through 12,000 commercial-loan contracts in seconds, compared with the 360,000 hours it used to take lawyers and loan officers to review the contracts. [So maybe once again I am focused on one of the least remunerative aspects of a new technology…]
  • Perhaps the newest frontier for machine-learning is in trading, where it is used both to crunch market data and to select and trade portfolios of securities. The quantitative-investment strategies division at Goldman Sachs uses language processing driven by machine-learning to go through thousands of analysts’ reports on companies. It compiles an aggregate “sentiment score” based on the balance of positive to negative words. [Seems a bit simplistic, no?]

  • In other fields, however, machine-learning has game-changing potential. There is no reason to expect finance to be different. According to Jonathan Masci of Quantenstein, a machine-learning fund manager, years of work on rules-based approaches in computer vision—telling a computer how to recognise a nose, say— were swiftly eclipsed in 2012 by machine-learning processes that allowed computers to “learn” what a nose looked like from perusing millions of nasal pin-ups. Similarly, says Mr Masci, a machine-learning algorithm ought to beat conventional trading strategies based on rules set by humans. [The data point replicates, over the same timeframe, when Elijah Mayfield showed that off-the-shelf, open source machine learning could with days of work produce competitive results (for scoring essays)  the capabilities of decades-old rule-based systems (from e-Rater, Intelligent Essay Assessor and six others). See note below]


I would also note that such “supervised learning” machine learning applications that leverage NLP )natural-language processing tools, which are used in, but not by themselves good examples of, IA techniques) tools are now a standard “first stage” of Machine Learning that typically evolves toward some form of neural network-based improves, just as the “computer vision” example noted above did in subsequent iterations over the last five + years.

Good stuff.

for the Elijah Mayfield reference see:

  • Mayfield, E., & Rosé, C. P. (2013). LightSIDE: Open Source Machine Learning for Text Accessible to Non-Experts. Invited chapter in the Handbook of Automated Essay Grading.
  • Shermis, M. D., & Hamner, B. (2012). Contrasting state-of-the-art automated scoring of essays: Analysis. Annual National Council on Measurement in Education Meeting, March 29, 2012, pg. 1-54.

What’s goin’ on?

It happens that What’s Going On is the eleventh studio album by soul musician Marvin Gaye, released May 21, 1971.

Forty-six years later, now more than ever, I have to ask:  What’s Goin’ on?



Data Science Bowl 2017 – more AI for medicine and medical images

I was interested to read in the the piece in the MIT Technology Review,

Million-Dollar Prize Hints at How Machine Learning May Someday Spot Cancer

A million dollar prize certain grabbed some headlines, but the details of the winning solution – more image annotations (e.g. more trained doctors / technicians), plus partitioning the basic problem into a) finding nodules; and b) diagnosing cancer), are both clear signposts to the future. Indeed, the future of low-dose CT scans is certainly looking stronger.  And while progress with machine learning, medical imaging, and diagnostic medicine is not always linear (or straightforward, as we read here), 3D imagines that capture relative tissue density and other characteristics clearly provide a highly construct-relevant feature set that is making advances in this are steady and promising (editorial: in a way that other work (e.g. is this argument convincing?) relying on indirect features and characteristics (computational linguistics in this case) is not yet keeping up…).

Since Google’s acquisition of the Kaggle, I have not taken a new look at the Google tool set for creating deep learning networks, but the promise of introducing a “semantic data layer” based on a semantic grammar approach to rubric construction might offer a promising path to better machine understanding of text and speech.



Critical Thinking Assessment

Often, in the context of large scale testing programs, “critical thinking assessment” is represented more by “information synthesis“, “reading comprehension“, “problem solving” or other exercises that require an examinee to make a claim and cite evidence and reasoning to support it.


In some contexts this is also called “Argumentative Writing” — much as the “analyze and argument” question on the GMAT was once a common “analytical writing” task, but only one program that comes to mind — the CAE’s Collegiate Learning Exam Plus (or Minus or Pro of whatever the marketing types want to call it this year) — does or did (at one point) break out “problem solving” and “analytic reasoning & evaluation” as dimensions on a rubric for a performance task, although they may have moved toward generalize “analysis and problem solving” dimension in current exams.


In any event, the big news today is that I have discovered EXACTLY the self-paced, student-centric, topic-organized critical thinking product and platform I have long envisioned that would replace the beloved “SRA Reading Cards” of my youth.  A group in Chicago has created a modern, digital version of this tool — organized as a set of subject mater-organized topics, grade / difficulty sequenced, that (hopefully) are as interesting and “teachful” as the SRA reading card stories and articles were. Only here, students WRITE about what they read, not just answer MCQs.  And they are taught to cite evidence, make claims, explain reasoning — even identify counter-arguments!  Great stuff.

Read more about them at



Now that the Big Questions are Settled… Puppet or Chef?

My longstanding quest for the best tool set for rapid, flexible and powerful development (where for me “powerful” includes robust sql database support, strong dictionary / no
sql data support, access to powerful statistics libraries, machine learning libraries, NLP and other libraries, and support for web apps and restful, back end services), I have settled on Python3, MySQL / Aurora, PyCharm + Sublime, NLTK, numpy, flask and the rest.  I am already happy with my productivity, and have recently recognized a need to move from OS-X to Linux for more of the heavy lifting.  Which, naturally, means everything is going to AWS…
AWS has improved in a hundred ways in the last three years, and when I was last certified I thought it was the best thing ever. So January 31 I hope to be re-certified but have already begun to migrate my personal projects and infrastructure to the cloud… I expect this to take months as I interleave it with ongoing development and research projects.

All of which is good, and AWS goes a long way for me toward making infrastructure into software.  But there is another area I want to understand better and apply in my quest for more efficiency, and this raises the question in the title: Chef or Pup
pet? I wont throw in Ansible or Salt as this article does, and based on some of what I am reading, perhaps my Python penchant might argue one pupchef
, whereas my striving to use workplace-relevant tools and approaches across the board my weight my choice against what is optimums for my current daily workload.

Another factor might be how well AWS integrates with either product, which would weigh more heavily than my personal Python needs, as Ruby and other toolsets are likely to be more important to many of my future clients  / employers.

linuxacademy-graphicI also should give a big SHOUT OUT (is that big? ) to James and his team at LinuxAcademy who continue, five or seven years in, to innovate and do a fantastic job of providing top-flight hands-on training for AWS / Linux / Azure devs, sysops and architects.  Fantastic performance for a small firm that obviously has their priorities right!

But back on Chef vs. Puppet, I will find or create a comparison and figure out if either are going to save me cycles, make me more efficient, or just slow me (and my small team) down!




Listening to the Data – Four ways to tweak your Machine Learning models

Bias and variance, precision and recall –  these are concepts that, after a few months or maybe even a just a couple of weeks of crawling around in actual data, predictive models, and the study of where prediction and reality meet — begin to have an intuitive feel.  But it was nice to read recently a short piece that brings these concepts clearly into focus, and frames them in terms of model behavior.  This is something I will keep handy to share where my own jabbering on the subject is likely to be less clear and certainly less concise.  The source of the article was (via re-post) the KDNuggets blog, which is an excellent resource.

There are, perhaps unsurprisingly, many good “nuggets” on the KDnuggest blog / web site. And this latest item does a good job of explaining what is at some point intuitive to people who work with machine learning models regularly.  Perhaps this is particularly relevant to modeling and mining “text’ — the work I have been doing in Machine Learning — because it certainly is spot on. And this is more a way of describing how the math models the real world, and how the data is reflected in the math, so I expect this view is likely helpful to anyone modeling data.

The somewhat “click-bait” sounding title — “4 Reasons Your Machine Learning Model is Wrong” is only modestly apologized for with the “(and How to Fix It)” suffix, but makes me worry fake-aggressive, pretend-demeaning discourse could be among the worst forms of carry-over of 2016 into 2017.

I will instead remember that genuine aggressive, demaning discourse is worse… and continue to appreciate the sharing that sites like this do for the larger community.

Happy New Year!



The new PISA! Still Reviewing and Reading…

So much progress, so much exemplary assessment. So much data.

This will take some some number of weeks to process…. and I will try to link to the best bits. So far the Economist treatment is looking pretty thorough.

And of course the PISA web site itself has a visualization tool… i think…

The interactive “problem solving” exercises — using “MicroDYN” systems and “finite-state automata” in particular look really interesting.

The test is here;  more results and more start here.