There is a massive difference between generating the input data needed by a Machine Learning model once (to prove a concept), and doing it continuously, indefinitely, at scale, and within short time periods. Years of experience working with disparate and imperfect data sets lead me to suspect that people trying to move desktop scale ML algorithms into production are likely to massively underestimate the time and energy required to get the data needed by ML algorithms into a form where they can use it.
“Garbage In, Garbage Out” applies to Machine Learning…
I recently read a Harvard Business Review article by Thomas C. Redman in which he pointed out that without data quality, Machine Learning algorithms will be doomed to fail in the real world. This resonated with me, as I’ve spent a significant part of my professional career managing, wrangling, cleaning, and otherwise working with large data feeds,.I know from bitter experience that quality data isn’t an accident. Thomas’s article focuses heavily on the processes that need to be put in place for one to have any chance of success; rather than plough the same furrow I’m going to discuss the ground level practicalities of data quality and what it means if ‘timeliness’ is an inherent part of your ‘data quality because your ML algorithm assumes up to date information.
1. “Ludic Fallacy” means your model can never be fully accurate.
Any model suffers from what Nassim Nicholas Taleb would call “Ludic Fallacy” – You lose critical detail by turning reality into a game, or in our case a model. For example: Monopoly doesn’t have earthquakes. We need to do this to avoid our model becoming unusably complex, and also because some events (such as Brexit!) have long term implications that are currently unknown and open to subjective interpretation. Missing data left out of the model may explain all manner of otherwise inexplicable events.
All forms of data modeling have to simplify reality, and in the process some fidelity is always lost. This creates risks for Machine Learning, as real world data may be more prone to modeling issues than the training data used for the Proof of Concept (POC).
The obvious solution to this is to add more detail to the model and have more fields, tables, relationships etc. But the more detailed you make the model the harder it is to work with and understand. This also assumes that you can get more data – in a lot of scenarios ML is expected to extract value out of existing, legacy streams of data.
2. The data in your model will always be slightly wrong.
In my entire career working with data I have never once seen a situation where a large set of real world data was a 100% accurate reflection of reality. Real world data streams are always imperfect. People fat finger numbers, data is sometimes missing, hardware reboots and sends data from Jan 1 1970 until its clock is reset, specifications are interpreted creatively, the wrong zip code is used, unique records aren’t, unique identifiers change, a data item sometimes has trailing spaces – the list of ways data can be ‘off’ is more or less infinite.
3. Merging multiple data streams is inherently difficult and prone to error
The fun really starts when your model requires you to merge multiple streams from different sources, a task which is merely awkward at a desktop level that can become overwhelming when real world volumes are involved. The single most common overlooked factor I’ve seen here is time – one of your streams will be ahead or behind another. This creates all sorts of opportunities for mischief and chaos. Until you’ve dealt with these issues in the field, it can be hard to understand just how difficult life becomes when you need to merge three streams, one of which is 30 minutes behind and another stops and starts randomly.
The second biggest challenge will be joining streams of data that were never intended to be joined. The causes of failure are many and varied, but subtle differences in the models used to define the streams are usually a prime suspect. In one case I encountered, a ‘shipment’ was a collection of physical things you take from a warehouse, put in a box and send to a customer. For a different part the same company, a ‘shipment’ was a contractual relationship with a customer – it could include a line item that obliged them to provide spare parts for 7 years or provide telephone support. It also used custom part numbers created by adding a contract number to a base part number. Joining the two streams was a nightmare.
Then we have to contend with the fact that what is apparently the same data can be represented differently in different systems. Spellings, codes and accented characters can all differ between streams. In one case, I found a system had 28 different ways to spell “Taiwan.”
Another issue is change management – if you are relying on other people’s second hand data, it can and will change without warning. Minor format changes could break things badly. Change management will almost never be an issue at the POC or hand joining of streams stage, but can become deeply problematic later.
4. As volumes increase, life will get much harder.
While people generally understand the direct implications of higher volumes, there are indirect side effects that can be deeply problematic. The first is that development and testing cycles get much, much longer, simply because of the time required to marshal all of the data. A lot of performance problems will only be visible when working with very large subsets of the data, so fitting everything onto a laptop will no longer be an option. The economics of these very large subsets may become a gating factor, as there will be fewer test environments than developers. It’s the difference between doing valet parking and doing valet parking for oil tankers – it’s the same simple task, but increasing scale makes it much harder.
5. Taking an ML model from desktop POC to running in production implies a massive, continuous effort.
Let’s assume we’ve managed to cope with points 1 through 5. What’s next? The reality may turn out to be that we spend far more time and energy wrangling the data so it’s usable by our ML engine than we did on the engine itself. In a scenario where my model needs a number of data points from many sources to generate one output, we’d suddenly find ourselves on an endless treadmill of managing and wrangling these data feeds — not just once but in real time, 24 hours a day, 365 days a year.
6. What happens if time is of the essence?
All of the above is based on an unstated assumption that we can tolerate lags of something between thirty minutes and two hours between when an outside event happens in the outside world and when it becomes usable in a feed. But what if we need all this faster? What if we need it in less than a second? While technologies like Kafka, Spark and Kinesis are good for time ranges down to a few seconds, below that you’ll need something like VoltDB, which is capable of absorbing hundreds of thousands of messages per second while still providing the sub millisecond responses that may be required to turn a machine learning engine into a source of income and ultimately profit.
While the conceptual foundations of Machine Learning are fairly solid, I would argue that not nearly enough thought has gone into how to go from the desktop ‘proof of concept’ level to 24/7/365 production, and arguably almost none into how to do this in a millisecond time range. I would say that as a rule of thumb, it can take ten times as much effort to go from desktop POC to production, and potentially ten times more than that to shrink timescales below a second. A lot of current high volume processing technologies are simply not designed to work in timescales of under a second, which is where products like VoltDB come in.