Extract, Explore, Transform, Model: How Machine Learning Projects Unfold
In How to Win at Machine Learning we designed a blueprint for getting a machine learning project off the ground. In this follow-up, we’ll fill that blueprint in a bit, taking you through various stages of the process that you may experience as your project unfolds. Our advice for your first foray into ML is to use supervised or unsupervised techniques, so we won’t go into more complex deep learning solutions here.
If you already have a strong data architecture, extracting the data from your store should be straightforward. If not, the data are likely to come from multiple – potentially disparate – sources, so you’ll need someone who can effectively extract, clean, and structure the data to prepare it for analysis. If you are building a long-running analytics system, we recommend you invest in engineering an efficient data extraction and transformation pipeline; this does not have to happen before you start using machine learning, but it will aid ML projects and their integration.
A good data scientist will not rush through the exploratory step of “interviewing” the data to develop their understanding and intuition. Assess how many and which data are missing, check for duplicates and outliers, evaluate the risk of data leakage if you’re using supervised techniques, and employ visualization to start developing the questions that drive your analysis. You may use statistical modelling to identify any underlying distributions and relationships in the data, which can clue you in to the most appropriate machine learning techniques. After the exploratory data analysis (EDA), review any insights with project stakeholders, and calibrate expectations and next steps based on the initial findings. Next comes the fun part: predictive modelling with machine learning!
Transform (and continue exploring)
You’ll prepare the data as feedstock for the machine learning algorithms with a mix of transformations, like imputing missing data, omitting data, extracting additional variables through feature engineering, and reducing the dimensionality of the dataset. Imputation may be as simple as filling gaps with a variable’s median across the dataset, or it could require the development of a more custom heuristic. It could even get algorithmic; the most appropriate tactic for imputation will vary based on context and time constraints. If you’re taking a supervised learning approach, by this step you should have a crystal clear idea of what the target outcome is, and every row in your training dataset should be labeled with that target. Throughout EDA and data transformation, an experienced data scientist will also be assessing the applicability of different algorithms to this particular problem, and designing a process to search for the best solution.
In the majority of cases, you should compare the strength of a few algorithms before choosing the final model. The decision is not a trivial one, but boils down to finding an algorithm that balances strong performance with accomplishing the task in the most straightforward way. You can compare the strength of algorithms by handing your data to each of them and scoring their performance on the task with a technique called cross-validation. The metric used to score performance will vary by context, but some common ones are R-squared for regression, and accuracy, precision, and recall for classification. For classification tasks, quantifying the cost of both a false positive prediction and a false negative prediction can help you to choose the most appropriate metric.
Almost every algorithm has a set of hyperparameters to tune. Unfortunately, most machine learning libraries will not sufficiently explain each one, nor how they interact. This is where having someone experienced with machine learning is invaluable: they will start by setting the model’s complexity based on experience and intuition, and iterate as results become available or use grid searching techniques to automate the search process. Once it’s narrowed down, cross-validation is also useful for optimizing the hyperparameters of the final algorithm, but take care—it can get tricky if you want to compare models and optimize your final model using the same dataset (hint: nested cross-validation).
This modelling phase requires patience, research and even innovation, but it’s when truths get uncovered. Don’t rush it, triple check your methodology, and remain patient—within a few iterations, you’ll know whether the data provide signal for the question you’re asking. If you’re a developer getting into machine learning, this is the part that will likely give you the most trouble. Here’s our recommendation for a resource to help get you up and running quickly.
The performance requirement for a model is subjective and highly dependent upon the domain, application, and what solution it replaces. By this point in the process, you’ll understand that performance is fundamentally constrained by the relevance of the data, the way you’ve framed the task, and the human and compute resources available. Based on what you learned in the EDA and modelling steps, your definition of success for the project may have shifted. It’s impossible to give universal advice about how to evaluate a machine learning model’s performance, but here is some general guidance:
- If your ad-hoc predictive model only marginally improves upon an efficient solution already in place, you probably shouldn’t take the risk to spend the resources to replace it.
- If you get to an MVP model and find that none of the available algorithms perform particularly well on the task, you have reached a pivot point early on, successfully avoiding any misguided data collection or expectations of performance. Instead, reframe the target, bring in additional or different data, or perhaps try deep learning.
- If you build a classification model that can perform in training to at least 0.70 on your chosen scoring metric (e.g. accuracy, precision, recall, f-score), you’ve successfully tapped into some signal, and there are options for how to dig in to diagnose errors and try to squeeze more performance out of the model.
- If you’re getting 0.85 or above on your metric, you’ve successfully greenlighted yourself to start thinking about how to deploy this model into production so you can see how it does in the wild, on completely unseen data.
Once you launch a machine learning model into production, it must be maintained by bringing in fresh data and retraining regularly—no matter the level of performance it hits in development. The frequency of retraining should track with the pace of change in the data. For example, if your model makes predictions about user engagement on a site where new users fall into the same categories as past users, you won't need to retrain as often as you would if the site were being visited by divergent user groups. Ideally, build a real time performance monitor, and react when you start to see the accuracy of predictions slip. The rate of degradation of a machine learning model will vary widely by context, but you should plan for ongoing maintenance.
The point of data science is to develop a more intimate understanding of your data, and consistently test hypotheses to improve your products and services. Employing machine learning as a method in service of that mission may or may not yield strong predictions, but if you go in with an attitude of curiosity you’ll be sure to extract actionable insight either way.
Amber Rivera has an obsession with efficiency to thank for leading her to programming, and an awe for the unpredictable for drawing her to machine learning. An investigator at heart, she brings a passion for rooting out arbitrary decisions to her work as a data scientist.