While “machine learning” has become a buzzword that is used to describe a wide range of modelling applications, there are fundamental differences between the two that data scientists should keep in mind.
Photo credit: Matan Segev
When does traditional statistical modeling (TSM) become machine learning (ML)?1 “Machine learning” has truly become a buzzword that is applied rather liberally to a wide range of modelling applications. But, the difference is far from a question of semantics: there are fundamental differences between ML and TSM that data practitioners should keep in mind.
But, let’s start off with some commonalities between ML and TSM. In both disciplines our aim is to build a (statistical) model (to use TSM terminology) that minimises loss, that is, that achieves the smallest possible difference between observed values and the values estimated by the model. In so doing, we have to achieve a successful balance between model complexity and generalizability: pick too complex a model and you’ll achieve a great fit on the data that you used to develop it; but your predictive power will be limited on unseen data.
We have successfully achieved our goal when, depending on the application, our model is able to explain a significant proportion of the variation in our dependent variable, achieves a high classification accuracy, or correctly predicts the number of occurrences of a particular phenomenon.
Some of the techniques that we use to achieve our goal - of minimising loss - are similar between ML and TSM too. For example, both rely on a cost function (such as the mean squared error, MSE), and on a way of optimizing that cost function.
In spite of some rather superficial similarities, there are important differences that make ML and TSM distinct disciplines.
Different philosophical foundations
Yet, in spite of these rather superficial similarities, there are important differences that make ML and TSM distinct disciplines. First, their purpose differs. ML is an algorithmic approach that is mostly interested in prediction. Conversely, TSM practitioners focus on the (statistical) significance of individual features (or rather: parameters). In the latter case, we are interested in explaining why x leads to y, rather than purely in constructing a model that predicts the occurrence of y as accurately as possible. Our conclusions about relationships between these variables, in turn, depend on our ability to construct an appropriate statistical model to represent our data.
ML and TSM have fundamentally different philosophical roots. As a branch of artificial intelligence2 and a sub-field of computer science, ML’s main purpose is to “learn” from data, that is, to optimize the weights on features (or: parameters), and to apply that information to generate new predictions. Its approach is inductive: we let the data tell give us the anwers, and do not rely on ex anteexpectations of how feature x may affect phenomenon y. This requires fewer distributional assumptions about our predictor variables, and fewer assumptions about which features are most predictive of our outcome of interest. (Note however that this is not to say that ML is not interested in selecting appropriate features to include! In fact, this is a crucial part of any successful ML application).
By contrast, traditional statistical modeling attempts to use mathematical formulas to formalize the relationship between two or more variables. As such, it is a subfield of mathematics, and is usually applied in a deductive context, that is, a research design where hypotheses are formulated and subsequently put to the test.
There also are important differences in terms of what we might call the “mechanics” of ML and TSM respectively. Let’s again take the example of linear regression. In linear regression, we aim to find the line (in a bivariate set-up) or the plane in a multidimensional space (in the multivariate variant) that minimises the sum of squared errors. Essentially, we draw the best line of fit through a number of observations, as illustrated in the figure below.In addition, a linear model in a machine learning setup would use a different optimisation algorithms such as gradient descent (GD). While TSM applications of linear regression can (and do) use GD, it becomes particularly relevant in an ML context. When we have large numbers of predictors (which usually is the case in ML problems, where we may have thousands of features), we need to use an optimizer that is computationally cheap enough to process large amounts of data. Gradient descent saves a lot of time on calculations compared to calculating parameters analytically (note that some optimization problems may not even have a closed-form solution due to their complexity!). We can significantly reduce our computation time in ML by using GD in mini-batch form, or in particular, in the stochastic variant, where we sample from data and change the model parameters just a little bit after each sampling.
This brings me to yet another difference between ML and TSM: scale.
A discussion of ML is not complete without mentioning “big data”. And indeed, a data scientist will not quickly grab for the ML toolbox unless they are dealing with masses of data, that are generated at great speed, and that are hugely diverse (i.e. unstructured) (for a discussion of the use of big data in political science, see my earlier blog post). When we are dealing with such “big data”, the standard TSM toolbox no longer applies: our data includes thousands of features, and we need an algorithmic approach to make sense of these masses of data.
The importance of assumptions
Finally, in TSM, our modelling strategy depends crucially on a set of assumptions. These include homoscedasticity of error terms, normality of our data-generating distributions, and linearity of functional dependencies. By contrast, ML can be seen as a “distribution-free” approach. We make few assumptions about the way the data is distributed and instead allow the training algorithm tell which models best approximate the data-generating process that underlies our data.3
Your choice between traditional statistics and machine learning should really be informed by the data problem that your facing.
In sum, there are important differences between machine learning and traditional statistical modeling. And, your choice between either should really be informed by the data problem that your facing. For example, TSM is preferable by far if you have a relatively small dataset that consists of structured data, and if your purpose is to identify if and to what extent a (limited) set of variables affect a phenomenon of interest. In other words: use the TSM toolbox when you are interested in taking a deductive approach to your research problem, i.e. when you start from theory and hypotheses, and use empirical data to confirm or reject your theoretical propositions.
By contrast, ML is your go-to strategy if you are dealing with large volumes of unstructured data, and your goal is to predict the extent or occurrence of a phenomenon as accurately as possible. For example: use ML if you are interested in predicting buyer behaviour in a webshop, churn, or employee retention.
This article was also published on the Oxford University Politics Blog.
Further reading and resources
- Understanding Machine Learning: From Theory to Algorithms, by Shai Shalev-Shwartz and Shai Ben-David
- The Elements of Statistical Learning: Data Mining, Inference, and Prediction, by Trevor Hastie, Robert Tibshirani, and Jerome Friedman
- An Introduction to Statistical Learning with Applications in R, by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani
- Google’s Machine learning crash course
1. I use terms from either discipline rather liberally in this post. For example, whereas ML speaks of “weights”, TSM usually refers to “coefficients”. The same goes for “features” vs. “independent variables”. I also do not venture into a discussion of supervised vs. unsupervised learning, and/or deep learning.
2. In turn, a key difference between AI and ML is that the latter does not try to imitate “intelligent” behaviour. Rather, ML is intended to complement and outperform human tasks using the strengths of computers.
3. ML is of course not altogether assumption-free. For one, in ML we assume the training samples the we draw from our distribution are i.i.d. (independently and identically distributed).