I came across two nice written pieces recently.
- Julia Rohrer discusses the importance of deciding on a research question, and a clear estimand, first, and only then choosing a model to estimate.
- John B. Carlin and Margarita Moreno-Betancur point out that there is more to good modeling than getting the best possible fit, and that it’s important to be clear on whether you are doing a descriptive, predictive, or causal analysis.
These and other closely related points have been made for a long time, and I’ve made related points before. I think it’s an underappreciated and somewhat counterintuitive point, so I like to harp on this a little.
When people seek to answer questions from data by using modeling, often they follow a process like this
- Try some set of models and set of features (and maybe engineered features) in the models.
- Pick the model that has the best fit on a held-out test set.
- Interpret the details of the chosen fitted model as descriptions of the underlying data.
For example, if we were trying to model the effect of education on income, maybe we’d build a model and anything else we can measure and think of that might affect income, like parents’ income and education, zip code, industry, and so on. Maybe we’d choose a linear model, and then we might interpret the coefficient on the education variable as the (causal?) effect of education on income.
This seems reasonable, but it’s a bad idea
There are two problems here. One that I will not talk about so much in this post is that we need to think very carefully about our assumptions before we interpret an estimate as causal. Building a model that predicts well just doesn’t cut it.
But a problem that is broader than just causal inference is that estimating effects (explaining) and predicting outcomes are two related but distinct tasks. Doing a good job at one doesn’t necessarily mean you’re doing a good job at the other. A model that does a really good job of predicting income isn’t necessarily good at estimating the effect of education on income, and vice-versa.
The intuitive reason why is that a data set only contains so much information, and we want to channel that information as fully as possible into answering the question we are trying to answer. Methods like targeted maximum likelihood estimation, which explicitly modify a fitted model’s predictions in order to better estimate a particular estimand, demonstrate this point particularly clearly.