There are things to fill and things to empty, things to take away and things to bring back, things to pick up and things to put down, and besides all that we have pencils to sharpen, holes to dig, nails to straighten, stamps to lick, and ever so much more. Why, if you stay here, you’ll never have to think again.The Terrible Trivium, “The Phantom Tollbooth”
I saw this graphic recently, and my immediate thought was: “We’re doing this wrong.”
Look at that graph, it says that 80% of the time spent on an ML project is dealing with data. Cleaning it, labeling it, augmenting it, and identifying it. Doesn’t that sound…tedious?
Like anyone who’s worked in ML, I’ve spent a fair bit of time curating datasets. I’ve personally gone through thousands, if not tens of thousands, of images of boxes, 3D scans of feet, and audio clips of people reading aloud, all to gain insight as to how to extract patterns from the data. And let me tell you, it’s not that much fun. But the dream, the dream is that once we have those insights, once we create a robust learning system, we just shovel more data into it and get wonderful results out the other side.
But that’s not the case. In reality, even after we have a learning system, data still needs curating. The data needs to be carefully screened and maintained, lest it damage the delicate insides of our fragile ML algorithm.
There are many reasons we do this. For starters, making an ML system robust to bad data (outliers, malformed or incomplete labels, etc) is hard. Particularly in the early stages of a project, it’s easier and faster to just sanitize the data, in order to more quickly test your approach and see if it’s viable. After all, why waste time making a system robust, if it doesn’t actually work for what you want it to do. Once it works, it’s also often easier to just keep sanitizing data in production. After all, it works, and there’s already a process for doing so. Besides, there are other, more interesting problems to solve.
However, I suspect an alternate, and perhaps even more important reason we spend so much time on data is that it’s cheaper than building better systems (in the short term). AI engineering is a highly specialized and expensive skill (I should know!), but data curation, while tedious, is easy to teach and cheap. With the advent of crowdsourcing (Amazon Mechanical Turk was among the first, but now there are dozens of companies clamouring for a chance to maintain your data), entire multi-thousand point datasets can be processed quickly (in parallel), with a high degree of reliability, and at known fixed cost. For an independent developer, or a small team, it certainly beats doing it yourself.
But it still feels wrong. The whole premise of robots is that they do work so humans don’t have to. Traditionally, these jobs are in one of three categories: The Dirty, The Dangerous, and the Dull. Cleaning sewers is dirty. Disarming improvised explosive devices is dangerous. Picking up packages and putting them on a conveyor is dull. Now, with the wonders of robotics and ML technologies, we have freed people from those jobs, and now they, and countless others get to sit, at their computer, for hours at a time, clicking on images.
Let’s see that data again.
I look at this and take away that productized ML of the sort analyzed here is still in its infancy. It’s still very much a research project, where all sorts of manual tweaks and steps are done in order to more quickly get to the answer to “Can we do this?” As we mature, we need to focus on getting ML systems that are less picky, and more efficient, with what they ingest. (Which I believe will also be cheaper in the long term).
FOUR AREAS OF ML WORK TO FOCUS ON:
Specifically, I think there needs to be a focus on building systems that:
- Use Raw data. That is, data in a raw, minimally processed form. This includes unlabelled data, as well as mis- or missing labels or portions of the data, and any outliers / corner cases that are encountered. Recent advances in unsupervised and semi-supervised learning are taking this approach. Being able to use raw data can enable system to better
- Use In-situ and In-operational data. Many ML systems are trained on manually collected (and potentially artificially augmented or simulated) data that is expected to be representative of the distribution of data that the system will encounter during operation. But, “the world is its own best model,” and pulling data from actual operation (so-called continuous learning) can lead to a better fitting system. Along with that, we should
- Use more specific models. Many, many, MANY applications these days are using the same, generic models to fit their data. These models, such as the DNN models YOLO, VGG, InceptionNet, etc, have been tested on benchmarks and have good results there. And, as general models, they should transfer to new tasks, given sufficient data. But, getting, and cleaning, and maintaining that data is tedious. Using a model more tailored for your use case can more efficiently leverage the data that you have, and the prior information that you know, to better accomplish your task. Of course, no matter how good the model, we’ll always need
- Human oversight for exceptions. For any learned model, there will always be exceptions. Rare outliers, changes in data distribution, and unanticipated uses can lead to systems being applied where they have not been designed to be. Keeping human oversight for these systems, such as supervised autonomy, is key.
Robots are here to make our lives better and free us from what we don’t want to do. But their ML systems require too much tending right now. Let’s help them be more self-sufficient.
I hope you found this article informative. Get in touch if you’re interested in learning more — firstname.lastname@example.org