Swan Intelligence

a practitioners view on data science

Predicting Churn

In mobile and web analytics churn describes the users who stop using a product. In a paid subscription service it's just the number of users cancelling the subscription. For free products churn is often approximated by the number of users who don't come back for a certain amount of time. 1

Especially when you grow fast, it is important to understand your churn. Because churn lags behind growth, but will really kick in after a phase of strong growth. It is the reason why a constant inflow of new users will not make a user base grow beyond a certain size.

For example assume that 30% of your users churn each month and you have 1 million new users per month. Then you will never reach 3.5 million users.

Churn is a combination of different types of users stopping to use a product after different usages. Cara might have been addicted to Candy Crush for a long time and could only abandoned it for her love of connecting dots. Carl on the other hand really likes candy and would never crush it. That's why he only shortly checked out the app in disgust.

It makes intuitive sense that long term users will be more loyal than a user who just installed your app because he liked the emojis in your new Facebook ad. But how much more loyal are your long term users? That depends on your product. Your historic usage data will tell you. Follow the cohort of users that started using your product in a given month and observe how many of them are still around in the following months. This leads to a curves like this:

curve

The different cohorts in this chart all show similar behaviour. The curves are all pretty close. And most of the difference is a result of the difference in the first month. The slopes of the curves in the later months are very similar. The slope represents the fraction of users churning in a given month. The main takeaway from this is:

The monthly churn of a cohort in a given month is similar to the monthly churn of the previous monthly cohort in the previous month.

Technically these curves do not only represent churn. If users get more likely to skip a month over time, this can lead to a downwards slope without churn. But in most cases this is rare enough for us regard these as churn curves. Of course the above only hold true, if there are no big changes to the product. If a new version introduces a bug, the churn curves can get much steeper from one month to the other.

If you know of a factor that has a significant impact on the likelihood of users to churn, it might make sense to split by this factor. At Clue for example we see different churn curves depending on the home country of a user. To predict our churn, we look at the monthly churn curves per country and this gives us a very good prediction of our actual churn.

Let's look at the simplest case in detail. We only use monthly churn curves and don't split by any other variable. Let's count the time in month since the product launch. Let $m$ denote the month a cohort signed up in and $C_m(t)$ be the number of users left at time $t$. We can now predict the churn we will see in August ($t=t_a$) already at the end of June ($t=t_j$). We split all the active users in June $A_{t_j}$ into cohorts $C_m(t_j)$ by the month they signed up in.

\[ A_{t_j} = \sum_{m=0}^{t_j} C_m(t_j) \]

The churn factor $f_{m}(t)$ we will see for these cohorts can be approximated by the previous cohorts churn in the previous month.

\[ f_{m}(t) \approx f_{m-1}(t-1) = 1 - \frac{C_{m-1}(t_j)}{C_{m-1}(t_j-1)} \]

The churn $c(t_a)$ in August can then be calculated by the sum of the churn of the cohorts.

\[ c(t_a) \approx \sum_{m=0}^{t_j} C_m(t_j) \cdot f_{m}(t) = \sum_{m=0}^{t_j} C_m(t_j) \cdot \big ( 1 - \frac{C_{m-1}(t_j)}{C_{m-1}(t_j-1)} \big ) \]

You can use this method to predict further ahead. But this leads to more uncertainty for two reasons. It will get more likely to see a substantial change in the shapes of the churn curves. And secondly you also have to factor in new users for the next month, which are not known yet.


  1. This amount of time should be defined depending on the user behavior. It should be so long that a user not coming back for this time has only a low chance of ever coming back. It should be so short that you can soon decide whether to count a user as churned or not. It's a trade off. 

Comments