You’ve been collecting a powder keg of data underneath your company’s products. Your intuition says all that’s needed is a timely spark to ignite it and opportunity awaits. Yet there you sit, not knowing whether to use a match, a torch, or brute force friction to get it fired up. Should you hire a data scientist right out the gate? Should you unleash a couple of curious engineers on your data? How do you maximize the likelihood you’ll get fireworks instead of sending resources up in flames?
This article will help you set expectations for your machine learning (ML) undertaking, and describe the planning and human assets you’ll need for success. It’s a complex field to branch into, so a proof-of-concept project is a prudent way to get a feel for what’s possible when you weave predictive modelling into your product and operations. If you’re a CTO, tech architect, product owner, or otherwise ready to structure a machine learning project, this article is for you.
First, adopt an attitude of exploration. We encourage you to think about machine learning not as a single project, but as another angle from which to analyze a problem or opportunity. Like any other analysis, articulate the objective clearly at the outset. Are you trying to increase conversions on your website? Do you want to predict equipment maintenance before failures happen? Maybe you’re mining to surface ideas for new product lines, or you’re seeking a bit of IP to help pull ahead of your competitors. You’ll learn quickly that ML algorithms won’t be able to forecast macro-level phenomena, like growth in market share. Where they’ll shine is in discrete tasks, like predicting the budget of a user on your e-commerce site, or organizing a pile of data on customer behavior to create discernable segments you can leverage via targeted marketing.
Take the time to clearly lay out the requirements you have for the model or potential ML system. This might include budget and time constraints, and should specify the kinds of questions you’d like to be able to answer once your model is up and running. If your model works well, be prepared to allocate resources to its ongoing maintenance. The integration of a successful model takes planning and resources—before diving in, think through the ways it should interface with the rest of your infrastructure. The modelling process will result in one of two outcomes: a sufficiently strong prediction of the thing you’re interested in, or bumping up against a wall for any number of reasons. Either way, you’ll be going back to the white board to iterate, so expect that from the beginning.
Pick Your Power
Next, choose between supervised or unsupervised techniques. To use supervised algorithms like regression or a decision tree, your data must have encoded within it precise data about the thing you’re aiming to predict. For example, if you want to predict when a customer will churn, your training dataset should include examples of customers who have churned, as well as customers who’ve remained active. Supervised techniques allow you to train a model to identify a certain case in the data; the algorithm needs to see a sufficient number and proportion of similar cases to learn the matching characteristics.
Your data must have encoded within it precise data about the thing you’re aiming to predict.
Unsupervised learning, in contrast, does not require a declared output variable of interest to be effective. Instead, unsupervised techniques are tools for exploring data by organizing and representing it in ways that may not have been attainable by a human analyst in an efficient amount of time. If you suspect that either a supervised or unsupervised approach could be fruitful for your project, we recommend starting with a dive into unsupervised techniques before committing to a specific target for the supervised model - you might find something you hadn’t considered.
It’s important to draw a distinction between predictive ad-hoc analyses, and predictive systems, both of which can benefit from machine learning techniques. Delivering an accurate estimate of revenue next quarter to your CEO ahead of the board meeting is an ad-hoc analysis; turning that into a live model that posts an up-to-date estimate to the CEO’s internal dashboard every week is a predictive system.
Predictive machine learning systems benefit from software engineering principles like unit testing, designing for reuse and scalability, and minimizing technical debt. Having an engineering mindset from the outset will pay off in spades: in addition to analyzing a problem, an engineer aims to design and build components that form an efficient and repeatable process. Building a predictive system will take longer than performing an ad-hoc analysis, and unless you very clearly understand your proposed solution we strongly recommend building an ad-hoc analysis before building an operationalized predictive system.
A question we get asked a lot is, “How much data do I need?” The best answer is the very unsatisfying, “It depends”, but generally it’s a function of the problem you’re trying to solve, which algorithm you’re using, and how relevant your data are for analyzing this particular problem. You may not know how relevant your data are until you begin modelling. If there is undeniable signal in your data, it can likely be detected with only a few thousand rows of data. While big data can sometimes overcome weak signal, if the data do not actually hold the insight you seek, increasing the amount of noisy data (more rows, same variables) won’t yield better results. Training on large datasets may require distributed computing tools and can slow down the modelling process in the short term, but if the data are useful, having more will create opportunities to strengthen the model. In this case, you’ll likely want to subsample from the larger dataset in the early phases of development for the sake of time and compute resources, then scale once you have a viable solution.
Which leads to the question of time: How long should you expect a machine learning project to take? Well, every project will be different; it’s a matter of human resources, domain, compute power, accessibility and size of the data, the ratio of signal to noise, scope of the question, integration, and a host of other variables. The project will take at least weeks, and quite realistically it could take months to build an end-to-end system. Luckily, you will cross milestones of insight along the way that feed your curiosity. Having a strong project manager, ideally someone with prior experience building ML systems, can help things progress in a timely manner and keep expectations aligned.
An invaluable component to any ML project is involving someone with domain expertise. That person doesn’t necessarily have to engage in the data science component of the work, but they should be available as a resource to those who are modelling the data. This can shave significant time off a project and make all the difference in terms of success—even the most experienced data scientist runs the risk of asking the wrong questions and going down rabbit holes without the advice and intuition of someone steeped in the domain. If the data are derived from a product or may be influential in the future of a product, involve the product owner, whose understanding of the broader context will be indispensable.
The most important skill set you’ll need on the team is that of a data scientist who has experience in machine learning. At a high level, that skillset is a combination of abilities in data engineering, data preparation, exploratory data analysis and visualization, statistical modelling, ML algorithms and their application, software development, distributed computing, and distilling technical insights effectively for business decision-making. These skills can be concentrated in one person, or spread across multiple contributors, but in most cases, all are necessary.
Once you’ve thought through each of these foundational questions and assembled an experienced team to execute the analysis, you may be wondering what to expect as the project unfolds, and how to measure its success. We feel strongly that success in a machine learning endeavor can come in many forms, and advise you not to expect an obvious finish line or a silver bullet solution. The best way to position the project for success is to set clear, realistic goals and structure it to be systematically executed with the guidance of someone experienced in ML techniques. You won’t get fireworks every time, but you can carefully ignite a glowing flame and continue to use its heat to power your next inquiry, and then your next one.
Are you wondering whether machine learning techniques could improve how you make decisions? We’d love to brainstorm some use cases with you.
Amber Rivera has an obsession with efficiency to thank for leading her to programming, and an awe for the unpredictable for drawing her to machine learning. An investigator at heart, she brings a passion for rooting out arbitrary decisions to her work as a data scientist.