Machine Learning for Text in the News (again): Finance

A short but interesting piece in The Economist this week entitled Machine-learning promises to shake up large swathes of finance, under a heading of “Unshackled algorithms” (located here).

Many of the usual observations and platitudes are contained herein, but I thought these quotes were notable:

  • Natural-language processing, where AI-based systems are unleashed on text, is starting to have a big impact in document-heavy parts of finance. In June 2016 JPMorgan Chase deployed software that can sift through 12,000 commercial-loan contracts in seconds, compared with the 360,000 hours it used to take lawyers and loan officers to review the contracts. [So maybe once again I am focused on one of the least remunerative aspects of a new technology…]
  • Perhaps the newest frontier for machine-learning is in trading, where it is used both to crunch market data and to select and trade portfolios of securities. The quantitative-investment strategies division at Goldman Sachs uses language processing driven by machine-learning to go through thousands of analysts’ reports on companies. It compiles an aggregate “sentiment score” based on the balance of positive to negative words. [Seems a bit simplistic, no?]

  • In other fields, however, machine-learning has game-changing potential. There is no reason to expect finance to be different. According to Jonathan Masci of Quantenstein, a machine-learning fund manager, years of work on rules-based approaches in computer vision—telling a computer how to recognise a nose, say— were swiftly eclipsed in 2012 by machine-learning processes that allowed computers to “learn” what a nose looked like from perusing millions of nasal pin-ups. Similarly, says Mr Masci, a machine-learning algorithm ought to beat conventional trading strategies based on rules set by humans. [The data point replicates, over the same timeframe, when Elijah Mayfield showed that off-the-shelf, open source machine learning could with days of work produce competitive results (for scoring essays)  the capabilities of decades-old rule-based systems (from e-Rater, Intelligent Essay Assessor and six others). See note below]


I would also note that such “supervised learning” machine learning applications that leverage NLP )natural-language processing tools, which are used in, but not by themselves good examples of, IA techniques) tools are now a standard “first stage” of Machine Learning that typically evolves toward some form of neural network-based improves, just as the “computer vision” example noted above did in subsequent iterations over the last five + years.

Good stuff.

for the Elijah Mayfield reference see:

  • Mayfield, E., & Rosé, C. P. (2013). LightSIDE: Open Source Machine Learning for Text Accessible to Non-Experts. Invited chapter in the Handbook of Automated Essay Grading.
  • Shermis, M. D., & Hamner, B. (2012). Contrasting state-of-the-art automated scoring of essays: Analysis. Annual National Council on Measurement in Education Meeting, March 29, 2012, pg. 1-54.

TensorFlow is released: Google Machine Learning for Everyone

2FNLTensorFlow_logoGoogle posted information about TensorFlow —  the release of as open source of a key bunch of machine learning tools on their Google research blog here.

Given the great piles of multi-dimensional tables (or arrays) of data machine learning typically involves, and (at least for us primitive users) the tremendous shovel work involved in massaging and pushing around these giant piles of data file (and sorting out the arcane naming schemes devised to try to help with this problem is almost a worse problem itself),

the appellation of “Tensor Flow” as a tool to help with this is at first blush very promising. That is, rather than just a library of mathmatical algorithm implementations, I am expecting something that can help make the machine learning work itself more manageable.

I suspect that just figuring out what this is will cost me a few days… but I have much to learn.



Fault Tolerance in Distributed Systems

Perhaps we are nearly at the point where saying “distributed systems” is as redundant as “software program” always has been, but for the moment I want to consider how a specific issue is heightened by the nature of modern, asynchronous systems, and that issue is “fault tolerance” generally as well as “cascading failures” specifically.

More and more such issues arise — and I was please to read a particularly lucid explanation of a popular and important design pattern used in many solutions: the Circuit Breaker pattern.  On Martin Fowler’s blog — haha. I was kind of surprised by that — but only because I don’t google interesting problems in architecture and design nearly as often as I’d like.

I can’t add any value to what he’s written here, so instead i will just quote briefly:

The basic idea behind the circuit breaker is very simple. You wrap a protected function call in a circuit breaker object, which monitors for failures. Once the failures reach a certain threshold, the circuit breaker trips, and all further calls to the circuit breaker return with an error, without the protected call being made at all. Usually you’ll also want some kind of monitor alert if the circuit breaker trips.

There are added bits about adding a capability to attempt automatic reset (at some specified interval) and discussions of other real-world refinements (e.g. different thresholds for different sorts of errors), but a hallmark of this sort of writing is that, at least for most of its intended audience, a simple example provided in detail, and pointers to additional kinds of flourishes and add-ons, is really all that is needed.

Courtesy of Martin Flower

Check it out!  And if you googled this topic, doubtless you have read or seen something about NetFlix’ Hystrix, which says on that getHub landing page:

Hystrix is a latency and fault tolerance library designed to isolate points of access to remote systems, services and 3rd party libraries, stop cascading failure and enable resilience in complex distributed systems where failure is inevitable.

It is a java implementation; there are other articles linked here and links to alternative Circuit-breaker patterns in RubyJavaGrails PluginC#AspectJ, and Scala listed at the bottom of the Fowler blog post.

Sample iOS App: internationalization (i18n), language learning & uiKit

I was pleased to see the ClustrMap graphic for this blog showing such a diverse geographic audience, which started me thinking about the importance, in many (but not all) cases of internationalizing device-based Apps to maximize the potential audience.

As this blog has featured various bits related to iOS development, I thought I’d remind readers that I have posted in my public git repository the source code for a fully internationalized language learning app for iPad (cn4l2 in the repo).

This is a fairly straightforward and structurally simple “view” App: a “game board” type primary window with a display area (set of tiles) and the usual score counter, timer and control buttons to operate the “game”.  Chinese 4 Beginners is a game designed primarily to help for speakers of Indo-European languages (specifically French, German, Spanish, Italian and English) learn to recognize Hanzi, or Chinese Characters.   In particular it was to help me practice Hanzi recognition, and to help we work through important sub-elements of how characters are formed, and how components can provide clues to sound and meaning.  As I worked through ways to remember various characters or elements, I compiled a short book of these basic and important words and word elements (based on my experience trying to learn Chinese, and then moving to live and work in China for three years — entirely in Hangzhou and initially to work for Hundsun Technologies, a Shanghai-listed software firm, in 2007).  I organized the basic elements I was trying to learn into sixteen sets of 18 words, created a “book” to describe the mnemonics I had found useful, and then made the book available both as a free PDF (see link at bottom of page) and in a the Lulu book store.

The App shows one particular set of design trade-offs for internationalization — in particular, rather than resizing labels or adjusting fonts dynamically as the UI adjusts to particular user languages, from German to Korean — I created language specific versions of the main “board” and four or so pop-over / modal screens (about, settings, instruction, username and high scorers), so that I could examine / QA the appearance of all the various game elements in the most efficient manner.  Of course, everything could have been set dynamically based on the UI language choice, but the complexity and amount of code required, even to support 8 languages, still seemed to give the advantage to this approach.  Had I been shooting for 12 languages form the beginning, or certainly 20, I’d have probably gone the fully dynamic, program-based approach.

A more sophisticated program could also dynamically create the “word tiles” from a database at the device-specific maximized resolution; a more ambitious one would have included “sound files” — or found, as is available native for most of these 8 languages in Android, text-to-speech APIs to use / include).

But this program helped teach me iOS, Xcode, and many other basic aspects of writing an app for customer use.

It has been interesting to see and compare how iPhone and iPad apps (they are basically the same but optimized for the specific device, with a couple extra features in the iPad version) sell — at what price points and to whom.  There are eight titles — one each for learning to recognize words in English, Chinese, Korean, Japanese, Spanish, German, French and Italian — for speakers of these same 8 languages.  So while I can tell from the title which language is being learned, the speakers language can at best be inferred from the store in which the product was purchases (since all 8 programs for iPhone and iPad have all 8 language UIs).  Some English and French speakers clearly buy the “Wordz for Beginners” in their own language to teach youngsters just learning to read; but most sales are for second language learners, and the clusters of L1 and L2 language pairs is interesting and fairly constant over four years — with Koreans as Language Learners at the top of the list, and more intra-asian language learners than westerners learning C/K/J and visa-versa.  That said, the sales by language to learn generally follow the proportion of speakers in the OECD countries, but the ranking of countries as “language learning” buyers is both a function of history, culture, iOS prevalence,  and economic development and size.  So I am expecting a surge of Chinese buyers (Welcome, China Mobile!) on the one hand, and, i suspect, learning of Italian to remain near the bottom of the list).  Just for fun I will get the latest sales and downloads by product info from App Annie and post a graphic below.  (n.b. PE aka Pocket Edition denotes iPhone app; otherwise they are iPad apps)  And perhaps next year i will try to Data Analytics and Visualization to derive and show the sales of language learning products by language of the learner.

Sales Numbers Summarized by App Annie.

Sales Numbers Summarized by App Annie.

Open Source Solutions: Q&A / community capability: OSQA

Open Source Question and Answer is a free, open source, python-based application that can be deployed to provide sophisticated community support to almost any online endeavor.

As they say on their site: “OSQA is the free, open source Q&A system you’ve been waiting for. Your OSQA site is more than just an FAQ page, it is a full-featured Q&A community. Users earn points and badges for useful participation, and everyone in the community wins.”

The system includes registration, posting with comments, and mechanisms to award participation points through interaction or community voting.  The site is not unlike the familiar “stack exchange”, but is a competitive alternative (OSQA says LinuxExchange switched to OSQA from their platform).

I note that Udacity is using OSQA, and so far they’ve made lots of good choices.

Yet More Reasons for Box2D Compilation Issues

And a one-shot, elegant solution.

More reasons for Box2D projects not compiling mostly boil down to a) various C++ vs. Objective C conventions and practices and b) build settings.  If you are getting zillions of compile issues from a project, especially a tutorial or sample you pulled down from the web (and must have worked for someone, somewhere, once), the major culprits are often:

  • wrong version of Box2D.  Cocos2d and Box2D and all the various open source and 3rd party supplied tools you might use to “leverage” your productivity bring with them varying degrees of stability and varying practices with regard to “backward compatibility”.  My initial impression is that, for most of the tools I am looking at or using, they are too new to be worried (yet?) about “installed base” and inconvenience for users (who are all, for the most part, using “free” stuff, I should note.)  The wrong version may manifest itself with different API’s — different method names, different arguments, different assumptions about data types both internal and external (e.g. in a particular version of a package it is integrated with, but for which you have already installed a newer or different version — probably to work with some other project!). [I think there is a small point here related to how “free” open source stuff can turnout to be in some circumstances…]
  • Objective C vs. C++ conventions.  Where [ #import “constants.h” ] might compare in your brain to be congruent to [ #import <constants.h> ], look closely — they are different.  It is easier to spot when you see these forms of compiler directives mixed together in the same code module — as they are in Cocos2DS and elsewhere.  So this should lead to a check of “compiler directives” in your build settings.  This should be the first thing you think of when you are getting “include file not found” even as you are staring at your project folder with ALL the necessary files in place.
  • Code file naming conventions.  With .h and .m files typically used in Xcode for Objective C, C++ conventions and compiler approaches to code type detection may vary.  Remember C++ files should be .mm or .cpp — and in any event it is worth learning how to find read Xcode build file settings.  There are specific directives about how the compiler should determine source code language, including what rules to use and how to apply them.  Check it out.
  • Near the top of my list of errors that were particularly strange  — like “include file is missing” when it clearly is not — are error messages that say “Could not validate build due to missing Icon.png file.  Icon.png must be 57×57 pixels and found as 0 x 0 pixels”.  The file was there of course and easily verified to be the right size.  In the end this, too, was a build file setting issue — in Xcode 4.2+ i “fixed” a project by setting “pre-compress png files” to “NO”. Not that i’d ever heard of that setting or tried to set it myself before.
The good news is that there is, at the moment, an excellent way to get started and cure all these problems — at least for projects you create.  Check out Kobold2D.  (I link to the site and to Steffen Itterheim’s blog on my sidebar.)  Creating your project templates with Kobold 2D creates a complete app shell with all the third party packages you might want to include and bulletproof build files that remove ALL the sorts of problems identified above.  The limitation of this approach is that it means those great ‘sample projects’ you are working on are usually best re-implemented in a clean project template; there possibility of “fixing up” a project that has already packaged in the wrong version of something, or has build settings mangled beyond recognition, is slim to none.  I started down the path of upgrading an old Cocos2D project per the Kobold2d advice, and quickly saw it would be more work than starting the project over and moving in the bits of code that were of interest.
I am hoping Kobold2D is good for the long run, but right now it adds enough by itself that it is worth making part of your base development environment — particularly for newcomers to the platforms or tools.