In another post, I described how I fit a model to predict how well Buffer—a digital subscription-based firm that publicly releases much of its financial data—retains its customers. I used a methodology developed by Daniel McCarthy, Peter Fader, and Bruce Hardie (paper available for free download here) to fit the model. In this post, I’ll fill in some of the details and derivations that weren’t covered in the paper to explain why the model works and where it comes from.
Survival Analysis
To predict how many customers will be retained and how many will churn each month, McCarthy, Fader, and Hardie advocate using a technique from survival analysis. Survival analysis is a set of techniques used to predict how long something will “survive:” for example, how long a tire will last until it wears out, how long someone will live in a city, how old a person will be when they get married, how long a disease will persist before it is cured, or—in this case—how long a customer will stay with a firm before canceling their contract. In this post I’ll take some time to talk about survival analysis and the model that the authors advocate using. Some of the terms (like “survival”) are a bit misleading in the context of customer retention, because they describe predictions for survival times from disease. But these models have many uses in other contexts, and I’ll explain what the terms mean in this context. Also, throughout this post I’ll mostly borrow the notation used by McCarthy, Fader, and Hardie.
Probably the most basic model in survival analysis1 is the Cox proportional hazards model. This model starts from the assumption that each customer \(j\) has a baseline probability of churning \(\lambda_R^{(j)}(t)\)2 at any given time \(t\). This probability can be increased or decreased depending on the values of certain relevant characteristics \(X_{R}\). (In McCarthy et al.’s model, these are characteristics that apply to the whole customer base, not just individual customers; I’ll elaborate later.) These covariates are assumed to remain constant over time in the Cox proportional hazards model. With the baseline probability of churning along with these characteristics, the model estimates a probability of churning at a time \(t\) for individual \(j\), \(\lambda_R^{(j)}(t \mid X_R)\).
In the Cox proportional hazards model, the assumed relationship between these variables is as follows:
\(\log \big(\lambda_R^{(j)}(t \mid X_R) \big) = \log \big(\lambda_R^{(j)}(t) \big) + \sum_{k=1}^p X_{Rk} \beta_{Rk} = \log \big(\lambda_R^{(j)}(t) \big) + X_{R}\beta_R \)which yields
\(\lambda_R^{(j)}(t \mid X_R) =\lambda_R^{(j)}(t) \cdot e^{\sum_{k=1}^p X_{Rk} \beta_{Rk}} = \lambda_R^{(j)}(t)\exp(X_{R}\beta_R) \)where the variables have the following meanings in this context:
- \(j\) is a number assigned to each individual customer.
- \(t\) is a variable keeping track of time.
- \(X_R\) is a vector of covariates (with a total of \(p\) covariates that are measured).
- \(\lambda_R^{(j)}(t \mid X_R)\) represents the probability that customer \(j\) will churn at time \(t\). This is called the hazard function.
- \(\beta_R\) is a vector of coefficients that correspond to the variables in \(X_R\).
- \(\lambda_R^{(j)}(t)\) is the baseline probability of customer \(j\) churning at time \(t\), meaning the probability that a customer will churn when all of the covariates in \(X_R\) equal 0.
The proportional hazard function gets its name from the fact that the hazard is proportional to the baseline hazard \(\lambda_R^{(j)}(t)\), with the constant of proportionality being the covariate term \(\exp(X_{R}\beta_R)\).
The survival function is the probability that a customer has stayed with the company until at least time \(m’\) (in other words, it is the probability that a customer will not churn until after time \(m’\)). It turns out that if we assume that the baseline hazard \(\lambda_R^{(j)}(t)\) doesn’t change for cohorts of customers acquired at different times, we can obtain the survival function for a customer \(j\) who starts their subscription at time \(m\) from an integral involving the hazard function:
\(S_R^{(j)}(m’ – m \mid m, X_R) = \exp ( – \int_{0}^{m’ – m} \lambda_R^{(j)}(t \mid X_R) dt) \)(This represents the probability that a customer will maintain their subscription until at least time \(m’\) given that they were acquired at time \(m\)–that is, the probability that their subscription duration will be at least \(m’ – m\)). In the case of the Cox proportional hazard model, this works out to be
\(S_R^{(j)}(m’ – m \mid m, X_R) = \exp \big( – \int_{0}^{m’ – m} \lambda_R^{(j)}(t) \exp(X_{R}\beta_R) dt \big) = \exp \big(\exp(X_{R}\beta_R) \int_{0}^{m’ – m} -\lambda_R^{(j)}(t) dt \big) \)(Since the covariates in \(X_R\) are assumed to be constant over time, the term involving them can be pulled out of the integral.) At this point we can’t proceed without specifying the baseline function \(\lambda_R^{(j)}(t)\). The simplest assumption is that it is constant and identical for all customers–the risk of a customer canceling their contract is always the same, at any time for any customer, all else in \(X_R\) being equal. If \(\lambda_R^{(j)}(t) = \lambda_R\) (a constant), then we have
\(S_R^{(j)}(m’ – m \mid m, X_R) = \exp \big[- \lambda_R (m’ – m) \exp(X_{R}\beta_R) \big] = \exp \big[- \lambda_R (m’ – m) \big] \exp \big[\exp(X_{R}\beta_R) \big] \)One way to interpret this function is that we have a baseline survival function \(\exp \big[- \lambda_R (m’ – m) \big]\) which predicts the probability that a customer acquired at time \(m\) will be retained until at least time \(m’\) when the covariates all equal zero, and this probability is multiplied by a factor \( \exp \big[\exp(X_{R}\beta_R) \big] \) which can either increase or decrease the probability the customer will be retained for that long in a proportional way (that is, it extends or contracts the expected length of the customer relationship). Because the baseline survival function is a simple exponential function, this is called a proportional hazard model with an exponential baseline.
For customer retention, there is a lot of evidence that the risk of a customer churning usually isn’t constant over time. An alternative model that allows for the possibility of a non-constant baseline hazard \(\lambda_R^{(j)}(t)\) instead uses a Weibull distribution as a baseline. In this form, the baseline hazard function is
\(\lambda_R^{(j)}(t) = c_R \lambda_R^{(j)} t^{c_R-1} \)where \(c_R\) and \(\lambda_R^{(j)}\) are constants. Note that if \(c_R=1\), then the \(t\) term goes away and \(\lambda_R^{(j)}(t) =\lambda_R^{(j)}\) is constant, so in this case the model reduces to an exponential baseline. If \(c_R>1\), then the hazard increases over time–that is, customers are more likely to churn the longer they have been with the firm, all else being equal. If \(c_R<1\), then customers are less likely to churn the longer they have been with the firm.
With a Weibull baseline, the survival function looks as follows:
\(S_{R}^{(j)}(m’ – m \mid m, X_R) = \exp \big[ – \int_{0}^{m’ – m} c_R \lambda_R^{(j)} t^{c_R-1} dt \exp(X_{R}\beta_R) \big] = \exp \big[ – c_R \lambda_R^{(j)} \frac{{(m’ – m)}^{c_R}}{c_R} \exp(X_{R}\beta_R) \big] \) \(= \exp \big[ -\lambda_R^{(j)} {(m’ – m)}^{c_R} \big] \exp \big[ \exp(X_{R}\beta_R) \big] \)Again, it is clear that if \(c_R=1\) we have the proportional hazard model with an exponential baseline.
The covariates the authors use in \(X_R\) are dummies for which quarter the data comes from (to account for seasonality) as well as other time-varying covariates (such as whether the monthly data was collected during an economic recession). Because these covariates \(X_R\) vary over time, the model with a Weibull baseline is
\(S_R^{(j)}(m’ \mid m, X_R(t)) = \exp \big[ – \int_0^{m’ – m} c_R \lambda_R^{(j)} t^{c_R-1} \exp \big(X_R(t)\beta_R \big) dt \big]\)where \(X_{i}(t)\) represents the covariates for individual \(i\) at time \(t\), which must be included in the integral. Again, this introduces a complication, because the integral depends on the the form of the time-varying function \(X_R(t)\beta_R\). To resolve this issue, the authors use an approximation.
In this methodology, the authors work with data where churn is only observed once per month (at the end of every month), not continuously3. Because of that, they calculate their model under the assumption that \(X_R(t)\beta_R\) is piecewise constant–it retains the same value every month for the entire month. Then the above formula can be written
\(S_R^{(j)}(m’ \mid m, X_R(t)) = \exp \big[ – \sum_{i = m + 1}^{m’} \int_{i – m – 1}^{i – m} c_R \lambda_R^{(j)} t^{c_R-1} \exp \big(X_R(i)\beta_R \big) dt \big] = \exp \big[ – \sum_{i = m + 1}^{m’} \big( \exp \big(X_R(i)\beta_R \big) \int_{i – m – 1}^{i – m} c_R \lambda_R^{(j)} t^{c_R-1} dt \big) \big] \) \(= \exp \big[\sum_{i = m + 1}^{m’} -\lambda_R^{(j)} \big( {(i – m)}^{c_R} – (i – m – 1) ^{c_R} \big) \exp(X_{R}(i)\beta_R) \big] = \exp \big[-\lambda_R^{(j)} \sum_{i = m + 1}^{m’} \big({(i – m)}^{c_R} – (i – m – 1) ^{c_R} \big) \exp(X_{R}(i)\beta_R) \big] \)where the \(\exp \big(X_R(i)\beta_R \big)\) term can be taken out of the integral because at any given \(i\) it is fixed. At this point, since this formula is starting to get a bit unwieldy (and is about to get a little more complicated!), the authors use equation \((13)\) to simplify the expression:
\(B_R(m, m’) = \sum_{i = m + 1}^{m’} \big({(i – m)}^{c_R} – (i – m – 1) ^{c_R} \big) \exp(X_{R}(i)\beta_R) \) \( S_R^{(j)}(m’ \mid m, X_R(t)) = \exp \big[-\lambda_R^{(j)} B_R(m, m’) \big] \)The last expression above is the same as the one above equation \((13)\) in the paper. At this point, we could assume \(\lambda_R^{(j)}\) is the same for all customers; that is, \(\lambda_R^{(j)} = \lambda_R\) (a constant) and simply fit the data to this model, the parameters to fit being \(\lambda_R\), \(c_R\), and \(\beta_R\).
However, past research shows that there tends to be heterogeneity in the baseline probability of churning across customers, and that models that do not allow for such heterogeneity tend to perform poorly. An analyst who is working with a firm on a model to predict customer churn may have access to customer-level characteristics \(Z_j\) which might affect customer \(j\)‘s baseline hazard. In that case, the analyst could fit a model for each individual’s baseline hazard based on their individual characteristics:
\(\lambda_R^{(j)} = \lambda_R \exp(Z_j \gamma)\)where \(\lambda_R\) is a constant and \(\gamma\) is a vector of coefficients corresponding to the features in \(Z_j\). Then the model would become
\( S_R^{(j)}(m’ \mid m, X_R(t)) = \exp \big[-\lambda_R \exp(Z_j \gamma) B_R(m, m’) \big] \) \(= \exp \big[-\lambda_R \exp(Z_j \gamma) \sum_{i = m + 1}^{m’} \big({(i – m)}^{c_R} – (i – m – 1) ^{c_R} \big) \exp(X_{R}(i)\beta_R) \big]\) \(= \exp \big[-\lambda_R \sum_{i = m + 1}^{m’} \big({(i – m)}^{c_R} – (i – m – 1) ^{c_R} \big) \exp(X_{R}(i)\beta_R +Z_j \gamma) \big]\)But in this methodology we do not have access to customer-level data, just the aggregated data provided publicly by the firm. Accordingly, instead of modeling individual-level baseline hazards, we can account for heterogeneity in baseline hazard by assuming the population of customers have a baseline hazard that fits a certain distribution and then estimate the parameters describing that distribution. Then we would obtain the final aggregate retention model across all customers by integrating the model above across the distribution for \(\lambda_R^{(j)}\) (which can only take on positive values):
\(S_R(m’ \mid m, X_R(t)) = \int_0^\infty \exp \big[-\lambda_R B_R(m, m’) \big] \cdot f(\lambda_R) d \lambda_R\)where \(f(\lambda_R)\) is the probability density function of \(\lambda_R\). McCarthy, Fader, and Hardie choose the gamma distribution; that is, they assume \(\lambda_R^{(j)}\) follows a gamma distribution with parameters \((r_R, \alpha_R)\). In that case, this integral works out to
\(S_R(m’ \mid m, X_R(t)) = \bigg[ \frac{\alpha_R}{\alpha_R +B_R(m, m’)} \bigg]^{r_R} \)which is equation \((14)\) in the paper.
(Note: in my model, I did not see enough seasonal variation in Buffer’s retention rates to justify including seasonal dummies. I did include a dummy for the one-time surge in the acquisition and retention rates during two months in late 2013 to avoid having these observations skew the rest of my model. I assume that these outlier months were caused by something like a short-term promotion, although unfortunately after looking into public statements from Buffer I was unable to learn the cause. Going forward, a good next step would be to find out what caused the change in numbers, in case the cause comes back in the future, so i am able to create a more robust model.)
Residual Lifetime
As I will describe in a future post, I used my retention model to estimate the residual lifetime for different cohorts of customers. McCarthy, Fader, and Hardie provide the common formula for residual lifetime for a customer who was acquired in month \(m\) and still has their subscription active as of month \(M\):
\(\big[S_R(M – m \mid m, X_R(t)) \big]^{-1} \sum_{i=0}^\infty S_R(M + i – m \mid m, X_R(t))\)I computed this quantity for several cohorts by adding the first 1200 terms of this series (which means looking forward 1200 months, or 100 years. I found that additional terms afterward didn’t have much of an effect on expected residual lifetime, which is unsurprising–100 years is a plenty long time to look into the future!).
- For more detail on proportional hazard models, I found these notes very useful.
- Throughout this post and the paper, the subscripts R’s refer to “retention.”
- Actually, in the authors’ model churn is only observed quarterly, and from this they estimate monthly churn numbers. Luckily, in my data from Buffer, I can directly observe monthly churn numbers, so I don’t have to deal with this additional complication.