Select Page

Create Better Data Science Projects With Business Impact: Churn Prediction with R

Getting a job isn’t easy, you need to set yourself apart.

It’s not enough to just show that you know how to use a tool like scikit-learn or ggplot2, that’s a given.

So the question is, what can you do?

One of my favorite strategies is showing a business impact.

What do I mean by “business impact?”

At the end of the day, you’re going to directly or indirectly work on something that is geared towards the business side of things.

That might be reducing cost, increasing revenue, improving customer experience, and so on.

In this post I’ll show you step-by-step how to build a customer churn predictive model that shows a significant business impact.

Here’s a quick outline:

  • Business Background
  • Logistic Regression
  • Preparing The Data
  • Fitting The Model
  • Making Predictions
  • Business Impact

Business Background

At the beginning of any real-world data science project you need to start by asking a series of questions.

Here’s a few good ones to start with:

  • What’s the problem you’re trying to solve?
  • What’s your potential solution?
  • How will you evaluate your model?

Continuing with the customer churn idea from before, let’s assume that you work in the telecommunications industry.

One day your boss comes to you and asks, “how can we improve the business using our existing data?”

business data

Let’s take a look at answering each question in detail to come up with a strategic approach for tackling the boss’s question.

What’s the problem you’re trying to solve?

After looking at your existing data, you’ve identified that it’s about five times more expensive to acquire new customers than retain existing ones.

Now the more focused question becomes, “how can we increase customer retention rate to decrease cost?”

What’s your potential solution?

Since we have access to existing customer data, we can try and build a machine learning model to predict customer churn.

To avoid over complicating things with something too abstract to explain to management, we’ll use a logistic regression model.

How will you evaluate your model?

We’ll use a series of machine learning evaluation metrics (ROC, AUC, sensitivity, specificity) as well as business oriented metrics (cost savings).

ROC curve

Logistic Regression

Now that we’re familiar with the business background, and we’ve scoped our problem, let’s look at our potential solution.

There’s numerous machine learning models we could attempt to use. All have pros and cons.

To keep things simple, let’s just stick to logistic regression.

Logistic regression is a linear classifier. Since we’re trying to predict “churn” or “didn’t churn” a classification model is exactly what we need.

decision boundary

This is a great model because it’s easier to interpret than a lot of other models, like Random Forest. What I mean by “interpret” is that we can more easily see the relationship between the features and the output.

The downside to logistic regression is that it has a bias towards linear fits. If the decision boundary isn’t linear, it might not perform as well as a model like Random Forest.

Basically we’re trading off interpretability for flexibility. This is always a consideration when implementing machine learning models.

Preparing The Data

The next step is to prepare the data.

This workflow will vary from project to project, but for our example, I’m going to use the following workflow:

  1. Import the data
  2. Take a quick look
  3. Clean the data
  4. Split the data

I’ve left out a full exploratory phase in the workflow. In a future post I’ll go over that aspect because I think it’s equally, if not more important, than the modeling phase.

1) Import the data

Let’s start by importing the data. You can head on over to our github page to grab a copy.

We’ll be taking advantage of the Tidyverse library. This is a must-have library for data scientists using R. Be sure to check out their documentation.

library(tidyverse)

# setting the working directory
path_loc <- "C:/Users/Jonathan/Desktop/post"
setwd(path_loc)

# reading in the data
df <- read_csv("Telco Data.csv")

I like to set the path to my working directory at the beginning. Be sure to modify your path_loc variable to the working directory where your code and dataset will be located.

Since this is a csv file, I used the read_csv function to read the data into a dataframe df.

2) Take a quick look

After you import the data you’ll want to take a quick look at it. I think of a “quick look” as just that, as opposed to exploring the data.

I like to start by looking at the dimensions and the names of the feature columns.

# dimensions of the data
dim_desc(df)

# names of the data
names(df)

 

> # dimensions of the data
> dim_desc(df)
[1] "[7,043 x 21]"
> 
> # names of the data
> names(df)
 [1] "customerID"       "gender"           "SeniorCitizen"    "Partner"          "Dependents"       "tenure"          
 [7] "PhoneService"     "MultipleLines"    "InternetService"  "OnlineSecurity"   "OnlineBackup"     "DeviceProtection"
[13] "TechSupport"      "StreamingTV"      "StreamingMovies"  "Contract"         "PaperlessBilling" "PaymentMethod"   
[19] "MonthlyCharges"   "TotalCharges"     "Churn"  

 

We can see there’s 21 features and 7,043 rows. There’s various features such as TotalCharges and tenure . The output that we’re going to try and predict is “churn”.

Another great function is glimpse. This allows us to take a quick look at our features in more detail.

# taking a look at the data
glimpse(df)

 

> glimpse(df)
Observations: 7,043
Variables: 21
$ customerID        "7590-VHVEG", "5575-GNVDE", "3668-QPYBK", "7795-CFOCW", "9237-HQITU", "9305-CDSKC", "1452-K...
$ gender            "Female", "Male", "Male", "Male", "Female", "Female", "Male", "Female", "Female", "Male", "...
$ SeniorCitizen     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1...
$ Partner           "Yes", "No", "No", "No", "No", "No", "No", "No", "Yes", "No", "Yes", "No", "Yes", "No", "No...
$ Dependents        "No", "No", "No", "No", "No", "No", "Yes", "No", "No", "Yes", "Yes", "No", "No", "No", "No"...
$ tenure            1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 25, 69, 52, 71, 10, 21, 1, 12, 1, 58, 4...
$ PhoneService      "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Ye...
$ MultipleLines     "No phone service", "No", "No", "No phone service", "No", "Yes", "Yes", "No phone service",...
$ InternetService   "DSL", "DSL", "DSL", "DSL", "Fiber optic", "Fiber optic", "Fiber optic", "DSL", "Fiber opti...
$ OnlineSecurity    "No", "Yes", "Yes", "Yes", "No", "No", "No", "Yes", "No", "Yes", "Yes", "No internet servic...
$ OnlineBackup      "Yes", "No", "Yes", "No", "No", "No", "Yes", "No", "No", "Yes", "No", "No internet service"...
$ DeviceProtection  "No", "Yes", "No", "Yes", "No", "Yes", "No", "No", "Yes", "No", "No", "No internet service"...
$ TechSupport       "No", "No", "No", "Yes", "No", "No", "No", "No", "Yes", "No", "No", "No internet service", ...
$ StreamingTV       "No", "No", "No", "No", "No", "Yes", "Yes", "No", "Yes", "No", "No", "No internet service",...
$ StreamingMovies   "No", "No", "No", "No", "No", "Yes", "No", "No", "Yes", "No", "No", "No internet service", ...
$ Contract          "Month-to-month", "One year", "Month-to-month", "One year", "Month-to-month", "Month-to-mon...
$ PaperlessBilling  "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "No", "Yes", "No", "Yes", "No", "No", "Yes",...
$ PaymentMethod     "Electronic check", "Mailed check", "Mailed check", "Bank transfer (automatic)", "Electroni...
$ MonthlyCharges    29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.75, 104.80, 56.15, 49.95, 18.95, 100.35...
$ TotalCharges      29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949.40, 301.90, 3046.05, 3487.95, 587.45,...
$ Churn             "No", "No", "Yes", "No", "Yes", "Yes", "No", "No", "Yes", "No", "No", "No", "No", "Yes", "N...

 

3) Clean the data

We’ll need to do some minor cleaning before we can start using a logistic regression model.

Lets start by mutating the character variables to factors.

# changing character variables to factors
df <- df %>% mutate_if(is.character, as.factor)

 

In this code we’re using dplyr’s mutate_if function to change character variables to factor types.

The %>% is known as a pipe operator. It comes from the magrittr library which is also part of the Tidyverse.

The basic idea of this operator is that it makes your code more readable. You can learn more about the pipe operator here.

Lets also change the SeniorCitizen variable from an integer to a factor.


# changing SeniorCitizen variable to factor
df$SeniorCitizen <- as.factor(df$SeniorCitizen)

 

Next, we’ll check for missing values. We can do this with the map function from the purrr library. If you haven’t already guessed, this library is also part of the Tidyverse.

# looking for missing values
df %>% map(~ sum(is.na(.)))

 

> df %>% map(~ sum(is.na(.)))
$`customerID`
[1] 0

$gender
[1] 0

$SeniorCitizen
[1] 0

$Partner
[1] 0

$Dependents
[1] 0

$tenure
[1] 0

$PhoneService
[1] 0

$MultipleLines
[1] 0

$InternetService
[1] 0

$OnlineSecurity
[1] 0

$OnlineBackup
[1] 0

$DeviceProtection
[1] 0

$TechSupport
[1] 0

$StreamingTV
[1] 0

$StreamingMovies
[1] 0

$Contract
[1] 0

$PaperlessBilling
[1] 0

$PaymentMethod
[1] 0

$MonthlyCharges
[1] 0

$TotalCharges
[1] 11

$Churn
[1] 0

 

We can see that TotalCharges has 11 missing values. To replace these missing values we’ll just do a simple replacement with the median.

# imputing with the median
df <- df %>% 
  mutate(TotalCharges = replace(TotalCharges,
                                is.na(TotalCharges),
                                median(TotalCharges, na.rm = T)))

 

For a more technically rigorous approach to replacing missing values, check out this source.

The last thing we’ll do is remove the CustomerID feature. This is a unique identifier for each customer, so it’s probably not going to add any helpful information to our model.

# removing customerID; doesn't add any value to the model
df <- df %>% select(-customerID)

 

If you want to test this, you can always just leave this feature in and see how the results compare.

4) Split the data

To make sure that we’re not overfitting our model, we’ll split the data into a training set and a test set. This is known as cross validation.

We’ll train the model on the training set, and then test out its performance on the test set. We’ll randomly select 75% of the data to be our training set, and 25% to be our test set. I encourage you to try different splits to see how your results compare (maybe 80%/20% and 60%/40% for your train/test sets).

To create the splits we’ll use the Caret package. Lets start by importing Caret and setting a seed so that our results are reproducible.

library(caret)

# selecting random seed to reproduce results
set.seed(5)

 

Next, we’ll use the createDataPartition function to select 75% of our data to use in the training set. This will select a random sample of 75% of the rows.

# sampling 75% of the rows
inTrain <- createDataPartition(y = df$Churn, p=0.75, list=FALSE)

 

Finally, we’ll create our train and test dataframes using the sample of rows from above.

# train/test split; 75%/25%
train <- df[inTrain,]
test <- df[-inTrain,]

There’s other methods such as k-fold cross-validation, so be sure to read up on those as well.

For information on how to implement k-fold cross-validation in Caret, type in help(“createFolds”) in to the R console.

Fitting The Model

Now that we’ve split the data in to training and test sets, it’s time to fit the model.

To implement a logistic regression model, we’ll use the generalized linear models (GLM) function, glm.

There’s different types of GLMs, which includes logistic regression. To specify that we want to perform a binary logistic regression, we’ll use the argument family=binomial.

Here’s the full code for fitting the logistic regression model:

# fitting the model
fit <- glm(Churn~., data=train, family=binomial)

In the next section we’ll make predictions and evaluate our model.

Making Predictions

Now that we’ve fit our model, it’s time to see how it performs.

To do this, we’ll make predictions using the test data set. We’ll pass in the fit model from the previous section. To predict the probabilities we’ll specify type=”response”.

# making predictions
churn.probs <- predict(fit, test, type="response")
head(churn.probs)
> head(churn.probs)
         1          2          3          4          5          6 
0.32756804 0.77302887 0.56592677 0.20112771 0.05152568 0.15085976 

We’ll want to convert these probabilities to a binary response since the Churn variable we’re predicting is “Yes” or “No”.

I’m not sure how R is coding the response, but I can quickly check this using the contrasts function.

# Looking at the response encoding
contrasts(df$Churn)
> contrasts(df$Churn)
    Yes
No    0
Yes   1

Looking at the result, we can see that Yes is being encoded using 1.

Now that we know the response encoding, we can convert our predicted results to Yes and No responses.

We’ll set the response threshold at 0.5, so if a predicted probability is above 0.5, we’ll convert this response to Yes.

The final step is to convert the character responses to factor types. This is so that the encoding is correct for the logistic regression model.

# converting probabilities to "Yes" and "No" 
glm.pred = rep("No", length(churn.probs))
glm.pred[churn.probs > 0.5] = "Yes"
glm.pred <- as.factor(glm.pred)

We’ll take a closer look at the threshold later, so don’t worry about why we set it at 0.5. For now, we’ll just use this value for purposes of this example.

An important part of making predictions is evaluating and validating the model.

Let’s take a detailed look at some evaluation metrics, and finish up with a more rigorous approach to model validation, k-fold cross-validation.

Evaluating The Model

After we’ve made predictions, it’s time to evaluate our model.

A great tool to quickly do this is using the confusionMatrix function from Caret.

We’ll feed in our glm.pred array of predictions, as well as the actual results from test$Churn.

Finally, we’ll specify the positive class as “Yes” using positive=”Yes”.

# creating a confusion matrix
confusionMatrix(glm.pred, test$Churn, positive = "Yes")
> confusionMatrix(glm.pred, test$Churn, positive = "Yes")
Confusion Matrix and Statistics

          Reference
Prediction   No  Yes
       No  1165  205
       Yes  128  262
                                          
               Accuracy : 0.8108          
                 95% CI : (0.7917, 0.8288)
    No Information Rate : 0.7347          
    P-Value [Acc > NIR] : 4.239e-14       
                                          
                  Kappa : 0.4877          
 Mcnemar's Test P-Value : 3.117e-05       
                                          
            Sensitivity : 0.5610          
            Specificity : 0.9010          
         Pos Pred Value : 0.6718          
         Neg Pred Value : 0.8504          
             Prevalence : 0.2653          
         Detection Rate : 0.1489          
   Detection Prevalence : 0.2216          
      Balanced Accuracy : 0.7310          
                                          
       'Positive' Class : Yes     

This function produces both a confusion matrix, as well as other statistics of interest.

A confusion matrix displays how many correct and incorrect predictions were made for each class.

Here’s a quick look at the confusion matrix from our model:

We can see that the model correctly predicted “No” 1165 times, and incorrectly predicted “No” when the actual response was “Yes” 205 times.

Likewise, the model correctly predicted “Yes” when the actual answer was “Yes” 262 times. At the same time, it incorrectly predicted “Yes” when the actual response was “No” 128 times.

Our overall accuracy is 81%. A simple baseline model is to predict the majority class, which in our case is “No”.

If we just predicted the majority class, our accuracy would be 73%. There’s 1760 observations in the test set, and 1293 are “No”. If you divide the 1293 by 1760 that’s how you arrive at 73% accuracy.

Some other useful metrics are sensitivity and specificity. Since our classes are slightly imbalanced (~73%=“No”, ~27%=“Yes”) these metrics can be more useful.

The sensitivity indicates the “True Positive” rate. This is a measure of how accurately we predicted the positive class, which in our case is “Yes”.

The confusionMatrix function reports this directly, but if you want to calculate it yourself, divide the “True Positives” by the sum of the “True Positives” and “False Negatives”. Here’s a visual to make that clearer:

sensitivity

Another useful metric is the specificity, which indicates the “True Negative” rate. This is a measure of how accurately we predicted the negative class. Here’s a visual explaining specificity:

specificity

Another useful metric is the Area Under the Receiver Operating Characteristic (ROC) Curve, also known as AUC.

Initially implemented during World War II for analyzing radar signals, the ROC is a plot of the True Positive Rate vs. the False Positive Rate.

Going back to our original model, I stated that I would use 0.5 as the threshold for making “yes” (or positive) predictions. I didn’t really have a good justification for picking 0.5 though.

The ROC is a great tool because it plots the TPR vs. the FPR as the threshold is varied. We can make a plot of the ROC curve using the ROCR library. Here’s the full R code:

library(ROCR)
# need to create prediction object from ROCR
pr <- prediction(churn.probs, test$Churn)

# plotting ROC curve
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)

ROC

As mentioned before, another useful measure is the area under the ROC curve, known as the AUC.

The AUC can take on any value between 0 and 1, with 1 being the best. This is a convenient way to boil down the ROC to a single number for evaluating a model.
Here’s the R code for evaluating our model:

# AUC value
auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc
> auc
[1] 0.8481338

Our model has an AUC of 0.85, which is pretty good. If we were to just make random guesses, our ROC would be a 45 degree line. This would correspond to an AUC of 0.5.

We’re outperforming random guessing, so we know that our model is at least adding some value!

K-fold Cross Validation

Now that we’ve trained, tested, and evaluated our model, let’s evaluate the model a little more rigorously.

We split our dataset in to training and test datasets, which is a great way to try and prevent overfitting.

An even better approach is using K-fold Cross Validation.

k-fold cross validation

In this model validation technique, we randomly partition the data in to test and training sets by specifying a certain number of “folds”. In the example above, the number of folds is k=4.

After we run the model on each fold, we average the evaluation metric from each one. So if we ran the model four times using ROC, we would average each of the four ROC values together. This is a great way to try and prevent overfitting a model.

A common number of folds to use is 10, so that’s what we’ll use with our data. We’ll also repeat the process 3 times, just to add a little bit more technical rigor to our approach. Here’s the full R-code:

# setting a seed for reproduceability
set.seed(10)

# changing the positive class to "Yes"
df$Churn <- as.character(df$Churn)
df$Churn[df$Churn == "No"] <- "Y"
df$Churn[df$Churn == "Yes"] <- "N"
df$Churn[df$Churn == "Y"] <- "Yes"
df$Churn[df$Churn == "N"] <- "No"

# train control
fitControl <- trainControl(## 10-fold CV
  method = "repeatedcv",
  number = 10,
  ## repeated 3 times
  repeats = 3,
  classProbs = TRUE,
  summaryFunction = twoClassSummary)

# logistic regression model
logreg <- train(Churn ~., df,
                method = "glm",
                family = "binomial",
                trControl = fitControl,
                metric = "ROC")
logreg
> logreg
Generalized Linear Model 

7043 samples
  19 predictor
   2 classes: 'No', 'Yes' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 6338, 6339, 6338, 6339, 6339, 6339, ... 
Resampling results:

  ROC        Sens       Spec   
  0.8455546  0.5500297  0.89602

We start by setting up k-fold CV settings using the trainControl function. All of the inputs are pretty straightforward. As I mentioned previously, we’re using 10 folds, repeated 3 times.

Next we train our model. As we did earlier, we’re using the “binomial” family from the “glm” method. To evaluate our model, we’ll use “ROC”. The model is actually reporting AUC, but the way we specify that in the train function is with metric=”ROC”.

You’ll also notice that I changed the positive class to “Yes” right before the code for the trainControl function. I did this so that I could compare the sensitivity and specificity with our previous results. This is just a minor quirk in the function, so don’t get hung up on this.

The results are similar to what we got previously.

As before, our AUC is 0.85. This is reported in the output as ROC, but it’s actually the AUC.

The true positive rate (sensitivity) is 0.55 and the true negative rate (specificity) is 0.90.

Business Impact

So far we’ve used k-fold cross validation and logistic regression to predict customer churn.

We’ve looked at useful evaluation metrics such as AUC, sensitivity, and specificity.

That’s all well and good, but now what?

If we went to the CEO and presented these results alone, he or she would say “so what?”

Our actual goal with developing this model is to show a business impact. In our case, this would be cost savings.

Let me walk through how we would show a cost savings, step-by-step.

To start, let’s make some assumptions about some various costs.

Let’s assume that customer acquisition cost in the telecom industry is approximately $300. If we make a prediction that a customer won’t churn, but they actually do (false negative, FN), then we’ll have to go out and spend $300 to acquire a replacement for that customer.

Now let’s assume that it’s 5 times more expensive to acquire a new customer rather than retain an existing one. If we predict that a customer will churn, we’ll need to spend $60 to retain that customer.

Sometimes we’ll correctly predict that a customer will churn (true positive, TP), and sometimes we’ll incorrectly predict that a customer will churn (false positive, FP). In both cases, we’ll spend $60 to retain the customer.

Finally, there’s the scenario where we correctly predict that a customer won’t churn (true negative, TN). In this scenario we won’t spend any money. These are happy customers that we’ve correctly identified as happy.

Here’s a quick summary of the costs:

  • FN (predict that a customer won’t churn, but they actually do): $300
  • TP (predict that a customer would churn, when they actually would): $60
  • FP (predict that a customer would churn, when they actually wouldn’t): $60
  • TN (predict that a customer won’t churn, when they actually wouldn’t): $0

If we multiply the number of each prediction type (FN, TP, FP, and TN) by the cost associated with each and sum them up, then we can calculate the total cost associated with our model. Here’s what that equation looks like:

Cost = FN($300) + TP($60) + FP($60) + TN($0)

As an example, let’s assume that we make the following number of predictions of each:

  • FN = 10
  • TP = 5
  • FP = 5
  • TN = 5

Then our total cost would be $3600.

10*($300) + 5*($60) + 5*($60) + 5*($0) = $3600

Now let’s apply this cost evaluation to our model.

We’ll start by fitting the model, and making predictions in the form of probabilities.

# fitting the logistic regression model
fit <- glm(Churn~., data=train, family=binomial)

# making predictions
churn.probs <- predict(fit, test, type="response")

Next, let’s create a threshold vector and a cost vector.

# threshold vector
thresh <- seq(0.1,1.0, length = 10)

#cost vector
cost = rep(0,length(thresh))

The threshold vector will hold the threshold for each model. Before we were just using 0.5 as the threshold, but let’s look at thresholds in increments of 0.1 between 0 and 1 (e.g., 0.1, 0.2, 0.3…0.9, 1.0).

The cost vector will just be a vector of length 10 with zeros to start. We’ll fill this in as we loop through the function, evaluating each threshold as we go. To evaluate the cost, we’ll use the same cost equation we just went over.

Now we’ll create a for loop to make predictions using the various thresholds, and evaluate the cost of each as we go.

# cost as a function of threshold
for (i in 1:length(thresh)){
  
  glm.pred = rep("No", length(churn.probs))
  glm.pred[churn.probs > thresh[i]] = "Yes"
  glm.pred <- as.factor(glm.pred)
  x <- confusionMatrix(glm.pred, test$Churn, positive = "Yes")
  TN <- x$table[1]/1760
  FP <- x$table[2]/1760
  FN <- x$table[3]/1760
  TP <- x$table[4]/1760
  cost[i] = FN*300 + TP*60 + FP*60 + TN*0
}

Rather than use the total number of each outcome for TN, FP, FN, and TP, I used a percentage instead. That’s why I pulled each value out using the confusion matrix and divided by 1760.

There’s 1760 observations in our test set, so that’s where that number comes from. By doing it this way I’m calculating the cost per customer.

Now let’s assume that our company is currently using what we’ll call a “simple model” which just defaults to a threshold of 0.5. We can go ahead and fit that model, make predictions, and calculate the cost. We’ll call this model cost_simple.

# simple model - assume threshold is 0.5
glm.pred = rep("No", length(churn.probs))
glm.pred[churn.probs > 0.5] = "Yes"
glm.pred <- as.factor(glm.pred)

x <- confusionMatrix(glm.pred, test$Churn, positive = "Yes")
TN <- x$table[1]/1760
FP <- x$table[2]/1760
FN <- x$table[3]/1760
TP <- x$table[4]/1760
cost_simple = FN*300 + TP*60 + FP*60 + TN*0

Finally, we can put all of the results in a dataframe and plot them.

# putting results in a dataframe for plotting
dat <- data.frame(
  model = c(rep("optimized",10),"simple"),
  cost_per_customer = c(cost,cost_simple),
  threshold = c(thresh,0.5)
)

# plotting
ggplot(dat, aes(x = threshold, y = cost_per_customer, group = model, colour = model)) +
  geom_line() +
  geom_point()

cost optimization

Looking at the results, we can see that the minimum cost per customer is about $40.00 at a threshold of 0.2.

The “simple” model that our company is currently implementing costs about $48.00 per customer, at a threshold of 0.50.

If we assume that we have a customer base of approximately 500,000 the switching from the simple model to the optimized model produces a cost savings of $4MM.

# cost savings of optimized model (threshold = 0.2) compared to baseline model (threshold = 0.5)
savings_per_customer = cost_simple - min(cost)

total_savings = 500000*savings_per_customer

total_savings
> total_savings
[1] 4107955

This type of cost savings could be a significant business impact depending on the size of the business.

Conclusion

In this post we developed a machine learning model to predict customer churn.

Specifically, we walked through each of the following steps:

  • Business Background
  • Logistic Regression
  • Preparing The Data
  • Fitting The Model
  • Making Predictions
  • Business Impact

We concluded by developing an optimized logistic regression model for our customer churn problem.

Assuming the company is using a logistic regression model with a default threshold of 0.5, we were able to identify that the optimum threshold is actually 0.2.

This reduced cost per customer from $48.00 to $40.00. With a customer base of approximately 500,000 this would produce a yearly cost savings of $4MM. This was a significant business impact!

Although this was a purely hypothetical example, it’s very similar to what you will encounter in the real-world.

Being able to effectively identify a problem and show a real business impact will set you apart from other data scientists in the job market.

It’s one thing to be able to implement a machine learning model and evaluate it. If you can do all of that and show results in the form of a business impact as we did in this post, then you’ll be well on your way to landing a job!

Get Access to the top Data Science books in 2018.

Create Better Data Science Projects With Business Impact: Churn Prediction with R

Getting a job isn’t easy, you need to set yourself apart.

It’s not enough to just show that you know how to use a tool like scikit-learn or ggplot2, that’s a given.

So the question is, what can you do?

One of my favorite strategies is showing a business impact.

What do I mean by “business impact?”

At the end of the day, you’re going to directly or indirectly work on something that is geared towards the business side of things.

That might be reducing cost, increasing revenue, improving customer experience, and so on.

In this post I’ll show you step-by-step how to build a customer churn predictive model that shows a significant business impact.

Here’s a quick outline:

  • Business Background
  • Logistic Regression
  • Preparing The Data
  • Fitting The Model
  • Making Predictions
  • Business Impact

Business Background

At the beginning of any real-world data science project you need to start by asking a series of questions.

Here’s a few good ones to start with:

  • What’s the problem you’re trying to solve?
  • What’s your potential solution?
  • How will you evaluate your model?

Continuing with the customer churn idea from before, let’s assume that you work in the telecommunications industry.

One day your boss comes to you and asks, “how can we improve the business using our existing data?”

business data

Let’s take a look at answering each question in detail to come up with a strategic approach for tackling the boss’s question.

What’s the problem you’re trying to solve?

After looking at your existing data, you’ve identified that it’s about five times more expensive to acquire new customers than retain existing ones.

Now the more focused question becomes, “how can we increase customer retention rate to decrease cost?”

What’s your potential solution?

Since we have access to existing customer data, we can try and build a machine learning model to predict customer churn.

To avoid over complicating things with something too abstract to explain to management, we’ll use a logistic regression model.

How will you evaluate your model?

We’ll use a series of machine learning evaluation metrics (ROC, AUC, sensitivity, specificity) as well as business oriented metrics (cost savings).

ROC curve

Logistic Regression

Now that we’re familiar with the business background, and we’ve scoped our problem, let’s look at our potential solution.

There’s numerous machine learning models we could attempt to use. All have pros and cons.

To keep things simple, let’s just stick to logistic regression.

Logistic regression is a linear classifier. Since we’re trying to predict “churn” or “didn’t churn” a classification model is exactly what we need.

decision boundary

This is a great model because it’s easier to interpret than a lot of other models, like Random Forest. What I mean by “interpret” is that we can more easily see the relationship between the features and the output.

The downside to logistic regression is that it has a bias towards linear fits. If the decision boundary isn’t linear, it might not perform as well as a model like Random Forest.

Basically we’re trading off interpretability for flexibility. This is always a consideration when implementing machine learning models.

Preparing The Data

The next step is to prepare the data.

This workflow will vary from project to project, but for our example, I’m going to use the following workflow:

  1. Import the data
  2. Take a quick look
  3. Clean the data
  4. Split the data

I’ve left out a full exploratory phase in the workflow. In a future post I’ll go over that aspect because I think it’s equally, if not more important, than the modeling phase.

1) Import the data

Let’s start by importing the data. You can head on over to our github page to grab a copy.

We’ll be taking advantage of the Tidyverse library. This is a must-have library for data scientists using R. Be sure to check out their documentation.

library(tidyverse)

# setting the working directory
path_loc <- "C:/Users/Jonathan/Desktop/post"
setwd(path_loc)

# reading in the data
df <- read_csv("Telco Data.csv")

I like to set the path to my working directory at the beginning. Be sure to modify your path_loc variable to the working directory where your code and dataset will be located.

Since this is a csv file, I used the read_csv function to read the data into a dataframe df.

2) Take a quick look

After you import the data you’ll want to take a quick look at it. I think of a “quick look” as just that, as opposed to exploring the data.

I like to start by looking at the dimensions and the names of the feature columns.

# dimensions of the data
dim_desc(df)

# names of the data
names(df)

 

> # dimensions of the data
> dim_desc(df)
[1] "[7,043 x 21]"
> 
> # names of the data
> names(df)
 [1] "customerID"       "gender"           "SeniorCitizen"    "Partner"          "Dependents"       "tenure"          
 [7] "PhoneService"     "MultipleLines"    "InternetService"  "OnlineSecurity"   "OnlineBackup"     "DeviceProtection"
[13] "TechSupport"      "StreamingTV"      "StreamingMovies"  "Contract"         "PaperlessBilling" "PaymentMethod"   
[19] "MonthlyCharges"   "TotalCharges"     "Churn"  

 

We can see there’s 21 features and 7,043 rows. There’s various features such as TotalCharges and tenure . The output that we’re going to try and predict is “churn”.

Another great function is glimpse. This allows us to take a quick look at our features in more detail.

# taking a look at the data
glimpse(df)

 

> glimpse(df)
Observations: 7,043
Variables: 21
$ customerID        "7590-VHVEG", "5575-GNVDE", "3668-QPYBK", "7795-CFOCW", "9237-HQITU", "9305-CDSKC", "1452-K...
$ gender            "Female", "Male", "Male", "Male", "Female", "Female", "Male", "Female", "Female", "Male", "...
$ SeniorCitizen     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1...
$ Partner           "Yes", "No", "No", "No", "No", "No", "No", "No", "Yes", "No", "Yes", "No", "Yes", "No", "No...
$ Dependents        "No", "No", "No", "No", "No", "No", "Yes", "No", "No", "Yes", "Yes", "No", "No", "No", "No"...
$ tenure            1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 25, 69, 52, 71, 10, 21, 1, 12, 1, 58, 4...
$ PhoneService      "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Ye...
$ MultipleLines     "No phone service", "No", "No", "No phone service", "No", "Yes", "Yes", "No phone service",...
$ InternetService   "DSL", "DSL", "DSL", "DSL", "Fiber optic", "Fiber optic", "Fiber optic", "DSL", "Fiber opti...
$ OnlineSecurity    "No", "Yes", "Yes", "Yes", "No", "No", "No", "Yes", "No", "Yes", "Yes", "No internet servic...
$ OnlineBackup      "Yes", "No", "Yes", "No", "No", "No", "Yes", "No", "No", "Yes", "No", "No internet service"...
$ DeviceProtection  "No", "Yes", "No", "Yes", "No", "Yes", "No", "No", "Yes", "No", "No", "No internet service"...
$ TechSupport       "No", "No", "No", "Yes", "No", "No", "No", "No", "Yes", "No", "No", "No internet service", ...
$ StreamingTV       "No", "No", "No", "No", "No", "Yes", "Yes", "No", "Yes", "No", "No", "No internet service",...
$ StreamingMovies   "No", "No", "No", "No", "No", "Yes", "No", "No", "Yes", "No", "No", "No internet service", ...
$ Contract          "Month-to-month", "One year", "Month-to-month", "One year", "Month-to-month", "Month-to-mon...
$ PaperlessBilling  "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "No", "Yes", "No", "Yes", "No", "No", "Yes",...
$ PaymentMethod     "Electronic check", "Mailed check", "Mailed check", "Bank transfer (automatic)", "Electroni...
$ MonthlyCharges    29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.75, 104.80, 56.15, 49.95, 18.95, 100.35...
$ TotalCharges      29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949.40, 301.90, 3046.05, 3487.95, 587.45,...
$ Churn             "No", "No", "Yes", "No", "Yes", "Yes", "No", "No", "Yes", "No", "No", "No", "No", "Yes", "N...

 

3) Clean the data

We’ll need to do some minor cleaning before we can start using a logistic regression model.

Lets start by mutating the character variables to factors.

# changing character variables to factors
df <- df %>% mutate_if(is.character, as.factor)

 

In this code we’re using dplyr’s mutate_if function to change character variables to factor types.

The %>% is known as a pipe operator. It comes from the magrittr library which is also part of the Tidyverse.

The basic idea of this operator is that it makes your code more readable. You can learn more about the pipe operator here.

Lets also change the SeniorCitizen variable from an integer to a factor.


# changing SeniorCitizen variable to factor
df$SeniorCitizen <- as.factor(df$SeniorCitizen)

 

Next, we’ll check for missing values. We can do this with the map function from the purrr library. If you haven’t already guessed, this library is also part of the Tidyverse.

# looking for missing values
df %>% map(~ sum(is.na(.)))

 

> df %>% map(~ sum(is.na(.)))
$`customerID`
[1] 0

$gender
[1] 0

$SeniorCitizen
[1] 0

$Partner
[1] 0

$Dependents
[1] 0

$tenure
[1] 0

$PhoneService
[1] 0

$MultipleLines
[1] 0

$InternetService
[1] 0

$OnlineSecurity
[1] 0

$OnlineBackup
[1] 0

$DeviceProtection
[1] 0

$TechSupport
[1] 0

$StreamingTV
[1] 0

$StreamingMovies
[1] 0

$Contract
[1] 0

$PaperlessBilling
[1] 0

$PaymentMethod
[1] 0

$MonthlyCharges
[1] 0

$TotalCharges
[1] 11

$Churn
[1] 0

 

We can see that TotalCharges has 11 missing values. To replace these missing values we’ll just do a simple replacement with the median.

# imputing with the median
df <- df %>% 
  mutate(TotalCharges = replace(TotalCharges,
                                is.na(TotalCharges),
                                median(TotalCharges, na.rm = T)))

 

For a more technically rigorous approach to replacing missing values, check out this source.

The last thing we’ll do is remove the CustomerID feature. This is a unique identifier for each customer, so it’s probably not going to add any helpful information to our model.

# removing customerID; doesn't add any value to the model
df <- df %>% select(-customerID)

 

If you want to test this, you can always just leave this feature in and see how the results compare.

4) Split the data

To make sure that we’re not overfitting our model, we’ll split the data into a training set and a test set. This is known as cross validation.

We’ll train the model on the training set, and then test out its performance on the test set. We’ll randomly select 75% of the data to be our training set, and 25% to be our test set. I encourage you to try different splits to see how your results compare (maybe 80%/20% and 60%/40% for your train/test sets).

To create the splits we’ll use the Caret package. Lets start by importing Caret and setting a seed so that our results are reproducible.

library(caret)

# selecting random seed to reproduce results
set.seed(5)

 

Next, we’ll use the createDataPartition function to select 75% of our data to use in the training set. This will select a random sample of 75% of the rows.

# sampling 75% of the rows
inTrain <- createDataPartition(y = df$Churn, p=0.75, list=FALSE)

 

Finally, we’ll create our train and test dataframes using the sample of rows from above.

# train/test split; 75%/25%
train <- df[inTrain,]
test <- df[-inTrain,]

There’s other methods such as k-fold cross-validation, so be sure to read up on those as well.

For information on how to implement k-fold cross-validation in Caret, type in help(“createFolds”) in to the R console.

Fitting The Model

Now that we’ve split the data in to training and test sets, it’s time to fit the model.

To implement a logistic regression model, we’ll use the generalized linear models (GLM) function, glm.

There’s different types of GLMs, which includes logistic regression. To specify that we want to perform a binary logistic regression, we’ll use the argument family=binomial.

Here’s the full code for fitting the logistic regression model:

# fitting the model
fit <- glm(Churn~., data=train, family=binomial)

In the next section we’ll make predictions and evaluate our model.

Making Predictions

Now that we’ve fit our model, it’s time to see how it performs.

To do this, we’ll make predictions using the test data set. We’ll pass in the fit model from the previous section. To predict the probabilities we’ll specify type=”response”.

# making predictions
churn.probs <- predict(fit, test, type="response")
head(churn.probs)
> head(churn.probs)
         1          2          3          4          5          6 
0.32756804 0.77302887 0.56592677 0.20112771 0.05152568 0.15085976 

We’ll want to convert these probabilities to a binary response since the Churn variable we’re predicting is “Yes” or “No”.

I’m not sure how R is coding the response, but I can quickly check this using the contrasts function.

# Looking at the response encoding
contrasts(df$Churn)
> contrasts(df$Churn)
    Yes
No    0
Yes   1

Looking at the result, we can see that Yes is being encoded using 1.

Now that we know the response encoding, we can convert our predicted results to Yes and No responses.

We’ll set the response threshold at 0.5, so if a predicted probability is above 0.5, we’ll convert this response to Yes.

The final step is to convert the character responses to factor types. This is so that the encoding is correct for the logistic regression model.

# converting probabilities to "Yes" and "No" 
glm.pred = rep("No", length(churn.probs))
glm.pred[churn.probs > 0.5] = "Yes"
glm.pred <- as.factor(glm.pred)

We’ll take a closer look at the threshold later, so don’t worry about why we set it at 0.5. For now, we’ll just use this value for purposes of this example.

An important part of making predictions is evaluating and validating the model.

Let’s take a detailed look at some evaluation metrics, and finish up with a more rigorous approach to model validation, k-fold cross-validation.

Evaluating The Model

After we’ve made predictions, it’s time to evaluate our model.

A great tool to quickly do this is using the confusionMatrix function from Caret.

We’ll feed in our glm.pred array of predictions, as well as the actual results from test$Churn.

Finally, we’ll specify the positive class as “Yes” using positive=”Yes”.

# creating a confusion matrix
confusionMatrix(glm.pred, test$Churn, positive = "Yes")
> confusionMatrix(glm.pred, test$Churn, positive = "Yes")
Confusion Matrix and Statistics

          Reference
Prediction   No  Yes
       No  1165  205
       Yes  128  262
                                          
               Accuracy : 0.8108          
                 95% CI : (0.7917, 0.8288)
    No Information Rate : 0.7347          
    P-Value [Acc > NIR] : 4.239e-14       
                                          
                  Kappa : 0.4877          
 Mcnemar's Test P-Value : 3.117e-05       
                                          
            Sensitivity : 0.5610          
            Specificity : 0.9010          
         Pos Pred Value : 0.6718          
         Neg Pred Value : 0.8504          
             Prevalence : 0.2653          
         Detection Rate : 0.1489          
   Detection Prevalence : 0.2216          
      Balanced Accuracy : 0.7310          
                                          
       'Positive' Class : Yes     

This function produces both a confusion matrix, as well as other statistics of interest.

A confusion matrix displays how many correct and incorrect predictions were made for each class.

Here’s a quick look at the confusion matrix from our model:

We can see that the model correctly predicted “No” 1165 times, and incorrectly predicted “No” when the actual response was “Yes” 205 times.

Likewise, the model correctly predicted “Yes” when the actual answer was “Yes” 262 times. At the same time, it incorrectly predicted “Yes” when the actual response was “No” 128 times.

Our overall accuracy is 81%. A simple baseline model is to predict the majority class, which in our case is “No”.

If we just predicted the majority class, our accuracy would be 73%. There’s 1760 observations in the test set, and 1293 are “No”. If you divide the 1293 by 1760 that’s how you arrive at 73% accuracy.

Some other useful metrics are sensitivity and specificity. Since our classes are slightly imbalanced (~73%=“No”, ~27%=“Yes”) these metrics can be more useful.

The sensitivity indicates the “True Positive” rate. This is a measure of how accurately we predicted the positive class, which in our case is “Yes”.

The confusionMatrix function reports this directly, but if you want to calculate it yourself, divide the “True Positives” by the sum of the “True Positives” and “False Negatives”. Here’s a visual to make that clearer:

sensitivity

Another useful metric is the specificity, which indicates the “True Negative” rate. This is a measure of how accurately we predicted the negative class. Here’s a visual explaining specificity:

specificity

Another useful metric is the Area Under the Receiver Operating Characteristic (ROC) Curve, also known as AUC.

Initially implemented during World War II for analyzing radar signals, the ROC is a plot of the True Positive Rate vs. the False Positive Rate.

Going back to our original model, I stated that I would use 0.5 as the threshold for making “yes” (or positive) predictions. I didn’t really have a good justification for picking 0.5 though.

The ROC is a great tool because it plots the TPR vs. the FPR as the threshold is varied. We can make a plot of the ROC curve using the ROCR library. Here’s the full R code:

library(ROCR)
# need to create prediction object from ROCR
pr <- prediction(churn.probs, test$Churn)

# plotting ROC curve
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)

ROC

As mentioned before, another useful measure is the area under the ROC curve, known as the AUC.

The AUC can take on any value between 0 and 1, with 1 being the best. This is a convenient way to boil down the ROC to a single number for evaluating a model.
Here’s the R code for evaluating our model:

# AUC value
auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc
> auc
[1] 0.8481338

Our model has an AUC of 0.85, which is pretty good. If we were to just make random guesses, our ROC would be a 45 degree line. This would correspond to an AUC of 0.5.

We’re outperforming random guessing, so we know that our model is at least adding some value!

K-fold Cross Validation

Now that we’ve trained, tested, and evaluated our model, let’s evaluate the model a little more rigorously.

We split our dataset in to training and test datasets, which is a great way to try and prevent overfitting.

An even better approach is using K-fold Cross Validation.

k-fold cross validation

In this model validation technique, we randomly partition the data in to test and training sets by specifying a certain number of “folds”. In the example above, the number of folds is k=4.

After we run the model on each fold, we average the evaluation metric from each one. So if we ran the model four times using ROC, we would average each of the four ROC values together. This is a great way to try and prevent overfitting a model.

A common number of folds to use is 10, so that’s what we’ll use with our data. We’ll also repeat the process 3 times, just to add a little bit more technical rigor to our approach. Here’s the full R-code:

# setting a seed for reproduceability
set.seed(10)

# changing the positive class to "Yes"
df$Churn <- as.character(df$Churn)
df$Churn[df$Churn == "No"] <- "Y"
df$Churn[df$Churn == "Yes"] <- "N"
df$Churn[df$Churn == "Y"] <- "Yes"
df$Churn[df$Churn == "N"] <- "No"

# train control
fitControl <- trainControl(## 10-fold CV
  method = "repeatedcv",
  number = 10,
  ## repeated 3 times
  repeats = 3,
  classProbs = TRUE,
  summaryFunction = twoClassSummary)

# logistic regression model
logreg <- train(Churn ~., df,
                method = "glm",
                family = "binomial",
                trControl = fitControl,
                metric = "ROC")
logreg
> logreg
Generalized Linear Model 

7043 samples
  19 predictor
   2 classes: 'No', 'Yes' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 6338, 6339, 6338, 6339, 6339, 6339, ... 
Resampling results:

  ROC        Sens       Spec   
  0.8455546  0.5500297  0.89602

We start by setting up k-fold CV settings using the trainControl function. All of the inputs are pretty straightforward. As I mentioned previously, we’re using 10 folds, repeated 3 times.

Next we train our model. As we did earlier, we’re using the “binomial” family from the “glm” method. To evaluate our model, we’ll use “ROC”. The model is actually reporting AUC, but the way we specify that in the train function is with metric=”ROC”.

You’ll also notice that I changed the positive class to “Yes” right before the code for the trainControl function. I did this so that I could compare the sensitivity and specificity with our previous results. This is just a minor quirk in the function, so don’t get hung up on this.

The results are similar to what we got previously.

As before, our AUC is 0.85. This is reported in the output as ROC, but it’s actually the AUC.

The true positive rate (sensitivity) is 0.55 and the true negative rate (specificity) is 0.90.

Business Impact

So far we’ve used k-fold cross validation and logistic regression to predict customer churn.

We’ve looked at useful evaluation metrics such as AUC, sensitivity, and specificity.

That’s all well and good, but now what?

If we went to the CEO and presented these results alone, he or she would say “so what?”

Our actual goal with developing this model is to show a business impact. In our case, this would be cost savings.

Let me walk through how we would show a cost savings, step-by-step.

To start, let’s make some assumptions about some various costs.

Let’s assume that customer acquisition cost in the telecom industry is approximately $300. If we make a prediction that a customer won’t churn, but they actually do (false negative, FN), then we’ll have to go out and spend $300 to acquire a replacement for that customer.

Now let’s assume that it’s 5 times more expensive to acquire a new customer rather than retain an existing one. If we predict that a customer will churn, we’ll need to spend $60 to retain that customer.

Sometimes we’ll correctly predict that a customer will churn (true positive, TP), and sometimes we’ll incorrectly predict that a customer will churn (false positive, FP). In both cases, we’ll spend $60 to retain the customer.

Finally, there’s the scenario where we correctly predict that a customer won’t churn (true negative, TN). In this scenario we won’t spend any money. These are happy customers that we’ve correctly identified as happy.

Here’s a quick summary of the costs:

  • FN (predict that a customer won’t churn, but they actually do): $300
  • TP (predict that a customer would churn, when they actually would): $60
  • FP (predict that a customer would churn, when they actually wouldn’t): $60
  • TN (predict that a customer won’t churn, when they actually wouldn’t): $0

If we multiply the number of each prediction type (FN, TP, FP, and TN) by the cost associated with each and sum them up, then we can calculate the total cost associated with our model. Here’s what that equation looks like:

Cost = FN($300) + TP($60) + FP($60) + TN($0)

As an example, let’s assume that we make the following number of predictions of each:

  • FN = 10
  • TP = 5
  • FP = 5
  • TN = 5

Then our total cost would be $3600.

10*($300) + 5*($60) + 5*($60) + 5*($0) = $3600

Now let’s apply this cost evaluation to our model.

We’ll start by fitting the model, and making predictions in the form of probabilities.

# fitting the logistic regression model
fit <- glm(Churn~., data=train, family=binomial)

# making predictions
churn.probs <- predict(fit, test, type="response")

Next, let’s create a threshold vector and a cost vector.

# threshold vector
thresh <- seq(0.1,1.0, length = 10)

#cost vector
cost = rep(0,length(thresh))

The threshold vector will hold the threshold for each model. Before we were just using 0.5 as the threshold, but let’s look at thresholds in increments of 0.1 between 0 and 1 (e.g., 0.1, 0.2, 0.3…0.9, 1.0).

The cost vector will just be a vector of length 10 with zeros to start. We’ll fill this in as we loop through the function, evaluating each threshold as we go. To evaluate the cost, we’ll use the same cost equation we just went over.

Now we’ll create a for loop to make predictions using the various thresholds, and evaluate the cost of each as we go.

# cost as a function of threshold
for (i in 1:length(thresh)){
  
  glm.pred = rep("No", length(churn.probs))
  glm.pred[churn.probs > thresh[i]] = "Yes"
  glm.pred <- as.factor(glm.pred)
  x <- confusionMatrix(glm.pred, test$Churn, positive = "Yes")
  TN <- x$table[1]/1760
  FP <- x$table[2]/1760
  FN <- x$table[3]/1760
  TP <- x$table[4]/1760
  cost[i] = FN*300 + TP*60 + FP*60 + TN*0
}

Rather than use the total number of each outcome for TN, FP, FN, and TP, I used a percentage instead. That’s why I pulled each value out using the confusion matrix and divided by 1760.

There’s 1760 observations in our test set, so that’s where that number comes from. By doing it this way I’m calculating the cost per customer.

Now let’s assume that our company is currently using what we’ll call a “simple model” which just defaults to a threshold of 0.5. We can go ahead and fit that model, make predictions, and calculate the cost. We’ll call this model cost_simple.

# simple model - assume threshold is 0.5
glm.pred = rep("No", length(churn.probs))
glm.pred[churn.probs > 0.5] = "Yes"
glm.pred <- as.factor(glm.pred)

x <- confusionMatrix(glm.pred, test$Churn, positive = "Yes")
TN <- x$table[1]/1760
FP <- x$table[2]/1760
FN <- x$table[3]/1760
TP <- x$table[4]/1760
cost_simple = FN*300 + TP*60 + FP*60 + TN*0

Finally, we can put all of the results in a dataframe and plot them.

# putting results in a dataframe for plotting
dat <- data.frame(
  model = c(rep("optimized",10),"simple"),
  cost_per_customer = c(cost,cost_simple),
  threshold = c(thresh,0.5)
)

# plotting
ggplot(dat, aes(x = threshold, y = cost_per_customer, group = model, colour = model)) +
  geom_line() +
  geom_point()

cost optimization

Looking at the results, we can see that the minimum cost per customer is about $40.00 at a threshold of 0.2.

The “simple” model that our company is currently implementing costs about $48.00 per customer, at a threshold of 0.50.

If we assume that we have a customer base of approximately 500,000 the switching from the simple model to the optimized model produces a cost savings of $4MM.

# cost savings of optimized model (threshold = 0.2) compared to baseline model (threshold = 0.5)
savings_per_customer = cost_simple - min(cost)

total_savings = 500000*savings_per_customer

total_savings
> total_savings
[1] 4107955

This type of cost savings could be a significant business impact depending on the size of the business.

Conclusion

In this post we developed a machine learning model to predict customer churn.

Specifically, we walked through each of the following steps:

  • Business Background
  • Logistic Regression
  • Preparing The Data
  • Fitting The Model
  • Making Predictions
  • Business Impact

We concluded by developing an optimized logistic regression model for our customer churn problem.

Assuming the company is using a logistic regression model with a default threshold of 0.5, we were able to identify that the optimum threshold is actually 0.2.

This reduced cost per customer from $48.00 to $40.00. With a customer base of approximately 500,000 this would produce a yearly cost savings of $4MM. This was a significant business impact!

Although this was a purely hypothetical example, it’s very similar to what you will encounter in the real-world.

Being able to effectively identify a problem and show a real business impact will set you apart from other data scientists in the job market.

It’s one thing to be able to implement a machine learning model and evaluate it. If you can do all of that and show results in the form of a business impact as we did in this post, then you’ll be well on your way to landing a job!

Get Access to the top Data Science books in 2018.
Free Bonus: Top 12 Data Science Books for 2018

Free Bonus: Top 12 Data Science Books for 2018

You have Successfully Subscribed!

Top 12 Data Science Books for 2018

Top 12 Data Science Books for 2018

Sign up with your email address to get instant access for free!

You have Successfully Subscribed!

Churn Prediction R Code

Churn Prediction R Code

Click to get instant access to the FREE Customer Churn Prediction R Code!

You have Successfully Subscribed!