# Create Better Data Science Projects With Business Impact: Churn Prediction with R

**Getting a job isn’t easy, you need to set yourself apart.**

It’s not enough to just show that you know how to use a tool like **scikit-learn** or **ggplot2**, that’s a given.

So the question is, what can you do?

One of my favorite strategies is **showing a business impact**.

What do I mean by **“business impact?”**

At the end of the day, you’re going to directly or indirectly work on something that is geared towards the business side of things.

That might be **reducing cost**, **increasing revenue**, **improving customer experience**, and so on.

In this post I’ll show you step-by-step how to build a customer churn** predictive model** that shows a **significant business impact**.

Here’s a quick outline:

- Business Background
- Logistic Regression
- Preparing The Data
- Fitting The Model
- Making Predictions
- Business Impact

## Business Background

At the beginning of any real-world data science project you need to start by asking a series of questions.

Here’s a few good ones to start with:

- What’s the problem you’re trying to solve?
- What’s your potential solution?
- How will you evaluate your model?

Continuing with the **customer churn **idea from before, let’s assume that you work in the telecommunications industry.

One day your boss comes to you and asks, **“how can we improve the business using our existing data?”**

Let’s take a look at answering each question in detail to come up with a strategic approach for tackling the boss’s question.

*What’s the problem you’re trying to solve?*

*What’s the problem you’re trying to solve?*

After looking at your existing data, you’ve identified that it’s about five times more expensive to acquire new customers than retain existing ones.

Now the more focused question becomes, **“how can we increase customer retention rate to decrease cost?”**

**What’s your potential solution?**

**What’s your potential solution?**

Since we have access to existing customer data, we can try and build a machine learning model to predict **customer churn**.

To avoid over complicating things with something too abstract to explain to management, we’ll use a logistic regression model.

**How will you evaluate your model?**

**How will you evaluate your model?**

We’ll use a series of machine learning **evaluation metrics** (ROC, AUC, sensitivity, specificity) as well as **business oriented metrics** (cost savings).

## Logistic Regression

Now that we’re familiar with the business background, and we’ve scoped our problem, let’s look at our potential solution.

There’s numerous machine learning models we could attempt to use. All have pros and cons.

To keep things simple, let’s just stick to logistic regression.

Logistic regression is a linear classifier. Since we’re trying to predict “churn” or “didn’t churn” a classification model is exactly what we need.

This is a great model because it’s easier to interpret than a lot of other models, like Random Forest. What I mean by “interpret” is that we can more easily see the relationship between the features and the output.

The downside to logistic regression is that it has a bias towards linear fits. If the decision boundary isn’t linear, it might not perform as well as a model like Random Forest.

Basically we’re trading off interpretability for flexibility. This is always a consideration when implementing machine learning models.

## Preparing The Data

The next step is to prepare the data.

This workflow will vary from project to project, but for our example, I’m going to use the following workflow:

- Import the data
- Take a quick look
- Clean the data
- Split the data

I’ve left out a full exploratory phase in the workflow. In a future post I’ll go over that aspect because I think it’s equally, if not more important, than the modeling phase.

**1) Import the data**

Let’s start by importing the data. You can head on over to our github page to grab a copy.

We’ll be taking advantage of the Tidyverse library. This is a must-have library for data scientists using R. Be sure to check out their documentation.

```
library(tidyverse)
# setting the working directory
path_loc <- "C:/Users/Jonathan/Desktop/post"
setwd(path_loc)
# reading in the data
df <- read_csv("Telco Data.csv")
```

I like to set the path to my working directory at the beginning. Be sure to modify your `path_loc`

variable to the working directory where your code and dataset will be located.

Since this is a csv file, I used the `read_csv`

function to read the data into a dataframe `df`

.

**2) Take a quick look**

After you import the data you’ll want to take a quick look at it. I think of a “quick look” as just that, as opposed to exploring the data.

I like to start by looking at the dimensions and the names of the feature columns.

```
# dimensions of the data
dim_desc(df)
# names of the data
names(df)
```

```
> # dimensions of the data
> dim_desc(df)
[1] "[7,043 x 21]"
>
> # names of the data
> names(df)
[1] "customerID" "gender" "SeniorCitizen" "Partner" "Dependents" "tenure"
[7] "PhoneService" "MultipleLines" "InternetService" "OnlineSecurity" "OnlineBackup" "DeviceProtection"
[13] "TechSupport" "StreamingTV" "StreamingMovies" "Contract" "PaperlessBilling" "PaymentMethod"
[19] "MonthlyCharges" "TotalCharges" "Churn"
```

We can see there’s 21 features and 7,043 rows. There’s various features such as ` TotalCharges `

and `tenure `

. The output that we’re going to try and predict is “churn”.

Another great function is `glimpse`

. This allows us to take a quick look at our features in more detail.

```
# taking a look at the data
glimpse(df)
```

```
> glimpse(df)
Observations: 7,043
Variables: 21
$ customerID "7590-VHVEG", "5575-GNVDE", "3668-QPYBK", "7795-CFOCW", "9237-HQITU", "9305-CDSKC", "1452-K...
$ gender "Female", "Male", "Male", "Male", "Female", "Female", "Male", "Female", "Female", "Male", "...
$ SeniorCitizen 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1...
$ Partner "Yes", "No", "No", "No", "No", "No", "No", "No", "Yes", "No", "Yes", "No", "Yes", "No", "No...
$ Dependents "No", "No", "No", "No", "No", "No", "Yes", "No", "No", "Yes", "Yes", "No", "No", "No", "No"...
$ tenure 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 25, 69, 52, 71, 10, 21, 1, 12, 1, 58, 4...
$ PhoneService "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Ye...
$ MultipleLines "No phone service", "No", "No", "No phone service", "No", "Yes", "Yes", "No phone service",...
$ InternetService "DSL", "DSL", "DSL", "DSL", "Fiber optic", "Fiber optic", "Fiber optic", "DSL", "Fiber opti...
$ OnlineSecurity "No", "Yes", "Yes", "Yes", "No", "No", "No", "Yes", "No", "Yes", "Yes", "No internet servic...
$ OnlineBackup "Yes", "No", "Yes", "No", "No", "No", "Yes", "No", "No", "Yes", "No", "No internet service"...
$ DeviceProtection "No", "Yes", "No", "Yes", "No", "Yes", "No", "No", "Yes", "No", "No", "No internet service"...
$ TechSupport "No", "No", "No", "Yes", "No", "No", "No", "No", "Yes", "No", "No", "No internet service", ...
$ StreamingTV "No", "No", "No", "No", "No", "Yes", "Yes", "No", "Yes", "No", "No", "No internet service",...
$ StreamingMovies "No", "No", "No", "No", "No", "Yes", "No", "No", "Yes", "No", "No", "No internet service", ...
$ Contract "Month-to-month", "One year", "Month-to-month", "One year", "Month-to-month", "Month-to-mon...
$ PaperlessBilling "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "No", "Yes", "No", "Yes", "No", "No", "Yes",...
$ PaymentMethod "Electronic check", "Mailed check", "Mailed check", "Bank transfer (automatic)", "Electroni...
$ MonthlyCharges 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.75, 104.80, 56.15, 49.95, 18.95, 100.35...
$ TotalCharges 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949.40, 301.90, 3046.05, 3487.95, 587.45,...
$ Churn "No", "No", "Yes", "No", "Yes", "Yes", "No", "No", "Yes", "No", "No", "No", "No", "Yes", "N...
```

**3) Clean the data**

We’ll need to do some minor cleaning before we can start using a logistic regression model.

Lets start by mutating the character variables to factors.

```
# changing character variables to factors
df <- df %>% mutate_if(is.character, as.factor)
```

In this code we’re using dplyr’s `mutate_if`

function to change character variables to factor types.

The `%>%`

is known as a pipe operator. It comes from the magrittr library which is also part of the Tidyverse.

The basic idea of this operator is that it makes your code more readable. You can learn more about the pipe operator here.

Lets also change the `SeniorCitizen`

variable from an integer to a factor.

```
# changing SeniorCitizen variable to factor
df$SeniorCitizen <- as.factor(df$SeniorCitizen)
```

Next, we’ll check for missing values. We can do this with the `map`

function from the purrr library. If you haven’t already guessed, this library is also part of the Tidyverse.

```
# looking for missing values
df %>% map(~ sum(is.na(.)))
```

```
> df %>% map(~ sum(is.na(.)))
$`customerID`
[1] 0
$gender
[1] 0
$SeniorCitizen
[1] 0
$Partner
[1] 0
$Dependents
[1] 0
$tenure
[1] 0
$PhoneService
[1] 0
$MultipleLines
[1] 0
$InternetService
[1] 0
$OnlineSecurity
[1] 0
$OnlineBackup
[1] 0
$DeviceProtection
[1] 0
$TechSupport
[1] 0
$StreamingTV
[1] 0
$StreamingMovies
[1] 0
$Contract
[1] 0
$PaperlessBilling
[1] 0
$PaymentMethod
[1] 0
$MonthlyCharges
[1] 0
$TotalCharges
[1] 11
$Churn
[1] 0
```

We can see that `TotalCharges`

has 11 missing values. To replace these missing values we’ll just do a simple replacement with the median.

```
# imputing with the median
df <- df %>%
mutate(TotalCharges = replace(TotalCharges,
is.na(TotalCharges),
median(TotalCharges, na.rm = T)))
```

For a more technically rigorous approach to replacing missing values, check out this source.

The last thing we’ll do is remove the `CustomerID`

feature. This is a unique identifier for each customer, so it’s probably not going to add any helpful information to our model.

```
# removing customerID; doesn't add any value to the model
df <- df %>% select(-customerID)
```

If you want to test this, you can always just leave this feature in and see how the results compare.

**4) Split the data**

To make sure that we’re not overfitting our model, we’ll split the data into a training set and a test set. This is known as cross validation.

We’ll train the model on the training set, and then test out its performance on the test set. We’ll randomly select 75% of the data to be our training set, and 25% to be our test set. I encourage you to try different splits to see how your results compare (maybe 80%/20% and 60%/40% for your train/test sets).

To create the splits we’ll use the Caret package. Lets start by importing Caret and setting a seed so that our results are reproducible.

```
library(caret)
# selecting random seed to reproduce results
set.seed(5)
```

Next, we’ll use the `createDataPartition`

function to select 75% of our data to use in the training set. This will select a random sample of 75% of the rows.

```
# sampling 75% of the rows
inTrain <- createDataPartition(y = df$Churn, p=0.75, list=FALSE)
```

Finally, we’ll create our train and test dataframes using the sample of rows from above.

```
# train/test split; 75%/25%
train <- df[inTrain,]
test <- df[-inTrain,]
```

There’s other methods such as k-fold cross-validation, so be sure to read up on those as well.

For information on how to implement k-fold cross-validation in Caret, type in `help(“createFolds”)`

in to the R console.

## Fitting The Model

Now that we’ve split the data in to training and test sets, it’s time to fit the model.

To implement a logistic regression model, we’ll use the generalized linear models (GLM) function, `glm`

.

There’s different types of GLMs, which includes logistic regression. To specify that we want to perform a binary logistic regression, we’ll use the argument `family=binomial`

.

Here’s the full code for fitting the logistic regression model:

```
# fitting the model
fit <- glm(Churn~., data=train, family=binomial)
```

In the next section we’ll make predictions and evaluate our model.

## Making Predictions

Now that we’ve fit our model, it’s time to see how it performs.

To do this, we’ll make predictions using the `test`

data set. We’ll pass in the `fit`

model from the previous section. To predict the probabilities we’ll specify `type=”response”`

.

```
# making predictions
churn.probs <- predict(fit, test, type="response")
head(churn.probs)
```

```
> head(churn.probs)
1 2 3 4 5 6
0.32756804 0.77302887 0.56592677 0.20112771 0.05152568 0.15085976
```

We’ll want to convert these probabilities to a binary response since the `Churn`

variable we’re predicting is “Yes” or “No”.

I’m not sure how R is coding the response, but I can quickly check this using the `contrasts`

function.

```
# Looking at the response encoding
contrasts(df$Churn)
```

```
> contrasts(df$Churn)
Yes
No 0
Yes 1
```

Looking at the result, we can see that `Yes`

is being encoded using `1`

.

Now that we know the response encoding, we can convert our predicted results to `Yes`

and `No`

responses.

We’ll set the response threshold at `0.5`

, so if a predicted probability is above `0.5`

, we’ll convert this response to `Yes`

.

The final step is to convert the character responses to factor types. This is so that the encoding is correct for the logistic regression model.

```
# converting probabilities to "Yes" and "No"
glm.pred = rep("No", length(churn.probs))
glm.pred[churn.probs > 0.5] = "Yes"
glm.pred <- as.factor(glm.pred)
```

We’ll take a closer look at the threshold later, so don’t worry about why we set it at `0.5`

. For now, we’ll just use this value for purposes of this example.

An important part of making predictions is evaluating and validating the model.

Let’s take a detailed look at some evaluation metrics, and finish up with a more rigorous approach to model validation, k-fold cross-validation.

**Evaluating The Model**

After we’ve made predictions, it’s time to evaluate our model.

A great tool to quickly do this is using the `confusionMatrix`

function from Caret.

We’ll feed in our `glm.pred`

array of predictions, as well as the actual results from `test$Churn`

.

Finally, we’ll specify the positive class as “Yes” using `positive=”Yes”`

.

```
# creating a confusion matrix
confusionMatrix(glm.pred, test$Churn, positive = "Yes")
```

```
> confusionMatrix(glm.pred, test$Churn, positive = "Yes")
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 1165 205
Yes 128 262
Accuracy : 0.8108
95% CI : (0.7917, 0.8288)
No Information Rate : 0.7347
P-Value [Acc > NIR] : 4.239e-14
Kappa : 0.4877
Mcnemar's Test P-Value : 3.117e-05
Sensitivity : 0.5610
Specificity : 0.9010
Pos Pred Value : 0.6718
Neg Pred Value : 0.8504
Prevalence : 0.2653
Detection Rate : 0.1489
Detection Prevalence : 0.2216
Balanced Accuracy : 0.7310
'Positive' Class : Yes
```

This function produces both a confusion matrix, as well as other statistics of interest.

A confusion matrix displays how many correct and incorrect predictions were made for each class.

Here’s a quick look at the confusion matrix from our model:

We can see that the model correctly predicted “No” 1165 times, and incorrectly predicted “No” when the actual response was “Yes” 205 times.

Likewise, the model correctly predicted “Yes” when the actual answer was “Yes” 262 times. At the same time, it incorrectly predicted “Yes” when the actual response was “No” 128 times.

Our overall accuracy is 81%. A simple baseline model is to predict the majority class, which in our case is “No”.

If we just predicted the majority class, our accuracy would be 73%. There’s 1760 observations in the test set, and 1293 are “No”. If you divide the 1293 by 1760 that’s how you arrive at 73% accuracy.

Some other useful metrics are **sensitivity** and **specificity**. Since our classes are slightly imbalanced (~73%=“No”, ~27%=“Yes”) these metrics can be more useful.

The **sensitivity** indicates the **“True Positive”** rate. This is a measure of how accurately we predicted the positive class, which in our case is “Yes”.

The `confusionMatrix`

function reports this directly, but if you want to calculate it yourself, divide the “True Positives” by the sum of the “True Positives” and “False Negatives”. Here’s a visual to make that clearer:

Another useful metric is the **specificity**, which indicates the **“True Negative”** rate. This is a measure of how accurately we predicted the negative class. Here’s a visual explaining specificity:

Another useful metric is the Area Under the Receiver Operating Characteristic (ROC) Curve, also known as **AUC**.

Initially implemented during World War II for analyzing radar signals, the ROC is a plot of the **True Positive Rate vs. the False Positive Rate**.

Going back to our original model, I stated that I would use 0.5 as the threshold for making “yes” (or positive) predictions. I didn’t really have a good justification for picking 0.5 though.

The ROC is a great tool because it plots the TPR vs. the FPR as the threshold is varied. We can make a plot of the ROC curve using the ROCR library. Here’s the full R code:

```
library(ROCR)
# need to create prediction object from ROCR
pr <- prediction(churn.probs, test$Churn)
# plotting ROC curve
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)
```

As mentioned before, another useful measure is the area under the ROC curve, known as the AUC.

The AUC can take on any value between 0 and 1, with 1 being the best. This is a convenient way to boil down the ROC to a single number for evaluating a model.

Here’s the R code for evaluating our model:

```
# AUC value
auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc
```

```
> auc
[1] 0.8481338
```

Our model has an AUC of 0.85, which is pretty good. If we were to just make random guesses, our ROC would be a 45 degree line. This would correspond to an AUC of 0.5.

We’re outperforming random guessing, so we know that our model is at least adding some value!

**K-fold Cross Validation**

Now that we’ve trained, tested, and evaluated our model, let’s evaluate the model a little more rigorously.

We split our dataset in to training and test datasets, which is a great way to try and prevent overfitting.

An even better approach is using K-fold Cross Validation.

In this model validation technique, we randomly partition the data in to test and training sets by specifying a certain number of “folds”. In the example above, the number of folds is k=4.

After we run the model on each fold, we average the evaluation metric from each one. So if we ran the model four times using ROC, we would average each of the four ROC values together. This is a great way to try and prevent overfitting a model.

A common number of folds to use is 10, so that’s what we’ll use with our data. We’ll also repeat the process 3 times, just to add a little bit more technical rigor to our approach. Here’s the full R-code:

```
# setting a seed for reproduceability
set.seed(10)
# changing the positive class to "Yes"
df$Churn <- as.character(df$Churn)
df$Churn[df$Churn == "No"] <- "Y"
df$Churn[df$Churn == "Yes"] <- "N"
df$Churn[df$Churn == "Y"] <- "Yes"
df$Churn[df$Churn == "N"] <- "No"
# train control
fitControl <- trainControl(## 10-fold CV
method = "repeatedcv",
number = 10,
## repeated 3 times
repeats = 3,
classProbs = TRUE,
summaryFunction = twoClassSummary)
# logistic regression model
logreg <- train(Churn ~., df,
method = "glm",
family = "binomial",
trControl = fitControl,
metric = "ROC")
logreg
```

```
> logreg
Generalized Linear Model
7043 samples
19 predictor
2 classes: 'No', 'Yes'
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 6338, 6339, 6338, 6339, 6339, 6339, ...
Resampling results:
ROC Sens Spec
0.8455546 0.5500297 0.89602
```

We start by setting up k-fold CV settings using the `trainControl`

function. All of the inputs are pretty straightforward. As I mentioned previously, we’re using 10 folds, repeated 3 times.

Next we train our model. As we did earlier, we’re using the “binomial” family from the “glm” method. To evaluate our model, we’ll use “ROC”. The model is actually reporting AUC, but the way we specify that in the `train`

function is with `metric=”ROC”`

.

You’ll also notice that I changed the positive class to “Yes” right before the code for the `trainControl`

function. I did this so that I could compare the sensitivity and specificity with our previous results. This is just a minor quirk in the function, so don’t get hung up on this.

The results are similar to what we got previously.

As before, our **AUC is 0.85**. This is reported in the output as ROC, but it’s actually the AUC.

The **true positive rate** (sensitivity) is **0.55** and the **true negative rate** (specificity) is **0.90**.

## Business Impact

So far we’ve used **k-fold cross validation** and **logistic regression** to predict customer churn.

We’ve looked at useful evaluation metrics such as **AUC**, **sensitivity**, and **specificity**.

That’s all well and good, but now what?

If we went to the CEO and presented these results alone, he or she would say **“so what?”**

Our actual goal with developing this model is to show a **business impact**. In our case, this would be **cost savings**.

Let me walk through how we would show a cost savings, step-by-step.

To start, let’s make some assumptions about some various costs.

Let’s assume that **customer acquisition cost** in the telecom industry is approximately **$300**. If we make a prediction that a customer won’t churn, but they actually do (false negative, FN), then we’ll have to go out and spend $300 to acquire a replacement for that customer.

Now let’s assume that it’s 5 times more expensive to acquire a new customer rather than retain an existing one. If we predict that a customer will churn, we’ll need to spend **$60 to retain that customer**.

Sometimes we’ll correctly predict that a customer will churn (true positive, TP), and sometimes we’ll incorrectly predict that a customer will churn (false positive, FP). In both cases, we’ll spend $60 to retain the customer.

Finally, there’s the scenario where we correctly predict that a customer won’t churn (true negative, TN). In this scenario we won’t spend any money. These are happy customers that we’ve correctly identified as happy.

Here’s a quick summary of the costs:

**FN**(predict that a customer won’t churn, but they actually do):**$300****TP**(predict that a customer would churn, when they actually would):**$60****FP**(predict that a customer would churn, when they actually wouldn’t):**$60****TN**(predict that a customer won’t churn, when they actually wouldn’t):**$0**

If we multiply the number of each prediction type (FN, TP, FP, and TN) by the cost associated with each and sum them up, then we can calculate the total cost associated with our model. Here’s what that equation looks like:

Cost = FN($300) + TP($60) + FP($60) + TN($0)

As an example, let’s assume that we make the following number of predictions of each:

- FN = 10
- TP = 5
- FP = 5
- TN = 5

Then our total cost would be **$3600**.

10*($300) + 5*($60) + 5*($60) + 5*($0) = **$3600**

Now let’s apply this cost evaluation to our model.

We’ll start by fitting the model, and making predictions in the form of probabilities.

```
# fitting the logistic regression model
fit <- glm(Churn~., data=train, family=binomial)
# making predictions
churn.probs <- predict(fit, test, type="response")
```

Next, let’s create a threshold vector and a cost vector.

```
# threshold vector
thresh <- seq(0.1,1.0, length = 10)
#cost vector
cost = rep(0,length(thresh))
```

The threshold vector will hold the threshold for each model. Before we were just using 0.5 as the threshold, but let’s look at thresholds in increments of 0.1 between 0 and 1 (e.g., 0.1, 0.2, 0.3…0.9, 1.0).

The cost vector will just be a vector of length 10 with zeros to start. We’ll fill this in as we loop through the function, evaluating each threshold as we go. To evaluate the cost, we’ll use the same cost equation we just went over.

Now we’ll create a for loop to make predictions using the various thresholds, and evaluate the cost of each as we go.

```
# cost as a function of threshold
for (i in 1:length(thresh)){
glm.pred = rep("No", length(churn.probs))
glm.pred[churn.probs > thresh[i]] = "Yes"
glm.pred <- as.factor(glm.pred)
x <- confusionMatrix(glm.pred, test$Churn, positive = "Yes")
TN <- x$table[1]/1760
FP <- x$table[2]/1760
FN <- x$table[3]/1760
TP <- x$table[4]/1760
cost[i] = FN*300 + TP*60 + FP*60 + TN*0
}
```

Rather than use the total number of each outcome for TN, FP, FN, and TP, I used a percentage instead. That’s why I pulled each value out using the confusion matrix and divided by 1760.

There’s 1760 observations in our test set, so that’s where that number comes from. By doing it this way I’m calculating the **cost per customer**.

Now let’s assume that our company is currently using what we’ll call a **“simple model”** which just defaults to a **threshold of 0.5**. We can go ahead and fit that model, make predictions, and calculate the cost. We’ll call this model `cost_simple`

.

```
# simple model - assume threshold is 0.5
glm.pred = rep("No", length(churn.probs))
glm.pred[churn.probs > 0.5] = "Yes"
glm.pred <- as.factor(glm.pred)
x <- confusionMatrix(glm.pred, test$Churn, positive = "Yes")
TN <- x$table[1]/1760
FP <- x$table[2]/1760
FN <- x$table[3]/1760
TP <- x$table[4]/1760
cost_simple = FN*300 + TP*60 + FP*60 + TN*0
```

Finally, we can put all of the results in a dataframe and plot them.

```
# putting results in a dataframe for plotting
dat <- data.frame(
model = c(rep("optimized",10),"simple"),
cost_per_customer = c(cost,cost_simple),
threshold = c(thresh,0.5)
)
# plotting
ggplot(dat, aes(x = threshold, y = cost_per_customer, group = model, colour = model)) +
geom_line() +
geom_point()
```

Looking at the results, we can see that the **minimum cost per customer** is about **$40.00** at a threshold of **0.2**.

The **“simple” model** that our company is currently implementing costs about **$48.00** per customer, at a threshold of **0.50**.

If we assume that we have a customer base of approximately **500,000** the switching from the simple model to the optimized model produces a **cost savings of $4MM**.

```
# cost savings of optimized model (threshold = 0.2) compared to baseline model (threshold = 0.5)
savings_per_customer = cost_simple - min(cost)
total_savings = 500000*savings_per_customer
total_savings
```

```
> total_savings
[1] 4107955
```

This type of cost savings could be a significant business impact depending on the size of the business.

## Conclusion

In this post we developed a machine learning model to predict **customer churn**.

Specifically, we walked through each of the following steps:

- Business Background
- Logistic Regression
- Preparing The Data
- Fitting The Model
- Making Predictions
- Business Impact

We concluded by developing an **optimized logistic regression model** for our customer churn problem.

Assuming the company is using a logistic regression model with a **default threshold of 0.5**, we were able to identify that the **optimum threshold is actually 0.2**.

This reduced cost per customer from **$48.00** to **$40.00**. With a customer base of approximately **500,000** this would produce a yearly **cost savings of $4MM**. This was a significant business impact!

Although this was a purely hypothetical example, it’s very similar to what you will encounter in the real-world.

Being able to effectively identify a problem and show a real **business impact** will set you apart from other data scientists in the job market.

It’s one thing to be able to implement a machine learning model and evaluate it. If you can do all of that and show results in the form of a business impact as we did in this post, then you’ll be well on your way to landing a job!

# Create Better Data Science Projects With Business Impact: Churn Prediction with R

**Getting a job isn’t easy, you need to set yourself apart.**

It’s not enough to just show that you know how to use a tool like **scikit-learn** or **ggplot2**, that’s a given.

So the question is, what can you do?

One of my favorite strategies is **showing a business impact**.

What do I mean by **“business impact?”**

At the end of the day, you’re going to directly or indirectly work on something that is geared towards the business side of things.

That might be **reducing cost**, **increasing revenue**, **improving customer experience**, and so on.

In this post I’ll show you step-by-step how to build a customer churn** predictive model** that shows a **significant business impact**.

Here’s a quick outline:

- Business Background
- Logistic Regression
- Preparing The Data
- Fitting The Model
- Making Predictions
- Business Impact

## Business Background

At the beginning of any real-world data science project you need to start by asking a series of questions.

Here’s a few good ones to start with:

- What’s the problem you’re trying to solve?
- What’s your potential solution?
- How will you evaluate your model?

Continuing with the **customer churn **idea from before, let’s assume that you work in the telecommunications industry.

One day your boss comes to you and asks, **“how can we improve the business using our existing data?”**

Let’s take a look at answering each question in detail to come up with a strategic approach for tackling the boss’s question.

*What’s the problem you’re trying to solve?*

*What’s the problem you’re trying to solve?*

After looking at your existing data, you’ve identified that it’s about five times more expensive to acquire new customers than retain existing ones.

Now the more focused question becomes, **“how can we increase customer retention rate to decrease cost?”**

**What’s your potential solution?**

**What’s your potential solution?**

Since we have access to existing customer data, we can try and build a machine learning model to predict **customer churn**.

To avoid over complicating things with something too abstract to explain to management, we’ll use a logistic regression model.

**How will you evaluate your model?**

**How will you evaluate your model?**

We’ll use a series of machine learning **evaluation metrics** (ROC, AUC, sensitivity, specificity) as well as **business oriented metrics** (cost savings).

## Logistic Regression

Now that we’re familiar with the business background, and we’ve scoped our problem, let’s look at our potential solution.

There’s numerous machine learning models we could attempt to use. All have pros and cons.

To keep things simple, let’s just stick to logistic regression.

Logistic regression is a linear classifier. Since we’re trying to predict “churn” or “didn’t churn” a classification model is exactly what we need.

This is a great model because it’s easier to interpret than a lot of other models, like Random Forest. What I mean by “interpret” is that we can more easily see the relationship between the features and the output.

The downside to logistic regression is that it has a bias towards linear fits. If the decision boundary isn’t linear, it might not perform as well as a model like Random Forest.

Basically we’re trading off interpretability for flexibility. This is always a consideration when implementing machine learning models.

## Preparing The Data

The next step is to prepare the data.

This workflow will vary from project to project, but for our example, I’m going to use the following workflow:

- Import the data
- Take a quick look
- Clean the data
- Split the data

I’ve left out a full exploratory phase in the workflow. In a future post I’ll go over that aspect because I think it’s equally, if not more important, than the modeling phase.

**1) Import the data**

Let’s start by importing the data. You can head on over to our github page to grab a copy.

We’ll be taking advantage of the Tidyverse library. This is a must-have library for data scientists using R. Be sure to check out their documentation.

```
library(tidyverse)
# setting the working directory
path_loc <- "C:/Users/Jonathan/Desktop/post"
setwd(path_loc)
# reading in the data
df <- read_csv("Telco Data.csv")
```

I like to set the path to my working directory at the beginning. Be sure to modify your `path_loc`

variable to the working directory where your code and dataset will be located.

Since this is a csv file, I used the `read_csv`

function to read the data into a dataframe `df`

.

**2) Take a quick look**

After you import the data you’ll want to take a quick look at it. I think of a “quick look” as just that, as opposed to exploring the data.

I like to start by looking at the dimensions and the names of the feature columns.

```
# dimensions of the data
dim_desc(df)
# names of the data
names(df)
```

```
> # dimensions of the data
> dim_desc(df)
[1] "[7,043 x 21]"
>
> # names of the data
> names(df)
[1] "customerID" "gender" "SeniorCitizen" "Partner" "Dependents" "tenure"
[7] "PhoneService" "MultipleLines" "InternetService" "OnlineSecurity" "OnlineBackup" "DeviceProtection"
[13] "TechSupport" "StreamingTV" "StreamingMovies" "Contract" "PaperlessBilling" "PaymentMethod"
[19] "MonthlyCharges" "TotalCharges" "Churn"
```

We can see there’s 21 features and 7,043 rows. There’s various features such as ` TotalCharges `

and `tenure `

. The output that we’re going to try and predict is “churn”.

Another great function is `glimpse`

. This allows us to take a quick look at our features in more detail.

```
# taking a look at the data
glimpse(df)
```

```
> glimpse(df)
Observations: 7,043
Variables: 21
$ customerID "7590-VHVEG", "5575-GNVDE", "3668-QPYBK", "7795-CFOCW", "9237-HQITU", "9305-CDSKC", "1452-K...
$ gender "Female", "Male", "Male", "Male", "Female", "Female", "Male", "Female", "Female", "Male", "...
$ SeniorCitizen 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1...
$ Partner "Yes", "No", "No", "No", "No", "No", "No", "No", "Yes", "No", "Yes", "No", "Yes", "No", "No...
$ Dependents "No", "No", "No", "No", "No", "No", "Yes", "No", "No", "Yes", "Yes", "No", "No", "No", "No"...
$ tenure 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 25, 69, 52, 71, 10, 21, 1, 12, 1, 58, 4...
$ PhoneService "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Ye...
$ MultipleLines "No phone service", "No", "No", "No phone service", "No", "Yes", "Yes", "No phone service",...
$ InternetService "DSL", "DSL", "DSL", "DSL", "Fiber optic", "Fiber optic", "Fiber optic", "DSL", "Fiber opti...
$ OnlineSecurity "No", "Yes", "Yes", "Yes", "No", "No", "No", "Yes", "No", "Yes", "Yes", "No internet servic...
$ OnlineBackup "Yes", "No", "Yes", "No", "No", "No", "Yes", "No", "No", "Yes", "No", "No internet service"...
$ DeviceProtection "No", "Yes", "No", "Yes", "No", "Yes", "No", "No", "Yes", "No", "No", "No internet service"...
$ TechSupport "No", "No", "No", "Yes", "No", "No", "No", "No", "Yes", "No", "No", "No internet service", ...
$ StreamingTV "No", "No", "No", "No", "No", "Yes", "Yes", "No", "Yes", "No", "No", "No internet service",...
$ StreamingMovies "No", "No", "No", "No", "No", "Yes", "No", "No", "Yes", "No", "No", "No internet service", ...
$ Contract "Month-to-month", "One year", "Month-to-month", "One year", "Month-to-month", "Month-to-mon...
$ PaperlessBilling "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "No", "Yes", "No", "Yes", "No", "No", "Yes",...
$ PaymentMethod "Electronic check", "Mailed check", "Mailed check", "Bank transfer (automatic)", "Electroni...
$ MonthlyCharges 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.75, 104.80, 56.15, 49.95, 18.95, 100.35...
$ TotalCharges 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949.40, 301.90, 3046.05, 3487.95, 587.45,...
$ Churn "No", "No", "Yes", "No", "Yes", "Yes", "No", "No", "Yes", "No", "No", "No", "No", "Yes", "N...
```

**3) Clean the data**

We’ll need to do some minor cleaning before we can start using a logistic regression model.

Lets start by mutating the character variables to factors.

```
# changing character variables to factors
df <- df %>% mutate_if(is.character, as.factor)
```

In this code we’re using dplyr’s `mutate_if`

function to change character variables to factor types.

The `%>%`

is known as a pipe operator. It comes from the magrittr library which is also part of the Tidyverse.

The basic idea of this operator is that it makes your code more readable. You can learn more about the pipe operator here.

Lets also change the `SeniorCitizen`

variable from an integer to a factor.

```
# changing SeniorCitizen variable to factor
df$SeniorCitizen <- as.factor(df$SeniorCitizen)
```

Next, we’ll check for missing values. We can do this with the `map`

function from the purrr library. If you haven’t already guessed, this library is also part of the Tidyverse.

```
# looking for missing values
df %>% map(~ sum(is.na(.)))
```

```
> df %>% map(~ sum(is.na(.)))
$`customerID`
[1] 0
$gender
[1] 0
$SeniorCitizen
[1] 0
$Partner
[1] 0
$Dependents
[1] 0
$tenure
[1] 0
$PhoneService
[1] 0
$MultipleLines
[1] 0
$InternetService
[1] 0
$OnlineSecurity
[1] 0
$OnlineBackup
[1] 0
$DeviceProtection
[1] 0
$TechSupport
[1] 0
$StreamingTV
[1] 0
$StreamingMovies
[1] 0
$Contract
[1] 0
$PaperlessBilling
[1] 0
$PaymentMethod
[1] 0
$MonthlyCharges
[1] 0
$TotalCharges
[1] 11
$Churn
[1] 0
```

We can see that `TotalCharges`

has 11 missing values. To replace these missing values we’ll just do a simple replacement with the median.

```
# imputing with the median
df <- df %>%
mutate(TotalCharges = replace(TotalCharges,
is.na(TotalCharges),
median(TotalCharges, na.rm = T)))
```

For a more technically rigorous approach to replacing missing values, check out this source.

The last thing we’ll do is remove the `CustomerID`

feature. This is a unique identifier for each customer, so it’s probably not going to add any helpful information to our model.

```
# removing customerID; doesn't add any value to the model
df <- df %>% select(-customerID)
```

If you want to test this, you can always just leave this feature in and see how the results compare.

**4) Split the data**

To make sure that we’re not overfitting our model, we’ll split the data into a training set and a test set. This is known as cross validation.

We’ll train the model on the training set, and then test out its performance on the test set. We’ll randomly select 75% of the data to be our training set, and 25% to be our test set. I encourage you to try different splits to see how your results compare (maybe 80%/20% and 60%/40% for your train/test sets).

To create the splits we’ll use the Caret package. Lets start by importing Caret and setting a seed so that our results are reproducible.

```
library(caret)
# selecting random seed to reproduce results
set.seed(5)
```

Next, we’ll use the `createDataPartition`

function to select 75% of our data to use in the training set. This will select a random sample of 75% of the rows.

```
# sampling 75% of the rows
inTrain <- createDataPartition(y = df$Churn, p=0.75, list=FALSE)
```

Finally, we’ll create our train and test dataframes using the sample of rows from above.

```
# train/test split; 75%/25%
train <- df[inTrain,]
test <- df[-inTrain,]
```

There’s other methods such as k-fold cross-validation, so be sure to read up on those as well.

For information on how to implement k-fold cross-validation in Caret, type in `help(“createFolds”)`

in to the R console.

## Fitting The Model

Now that we’ve split the data in to training and test sets, it’s time to fit the model.

To implement a logistic regression model, we’ll use the generalized linear models (GLM) function, `glm`

.

There’s different types of GLMs, which includes logistic regression. To specify that we want to perform a binary logistic regression, we’ll use the argument `family=binomial`

.

Here’s the full code for fitting the logistic regression model:

```
# fitting the model
fit <- glm(Churn~., data=train, family=binomial)
```

In the next section we’ll make predictions and evaluate our model.

## Making Predictions

Now that we’ve fit our model, it’s time to see how it performs.

To do this, we’ll make predictions using the `test`

data set. We’ll pass in the `fit`

model from the previous section. To predict the probabilities we’ll specify `type=”response”`

.

```
# making predictions
churn.probs <- predict(fit, test, type="response")
head(churn.probs)
```

```
> head(churn.probs)
1 2 3 4 5 6
0.32756804 0.77302887 0.56592677 0.20112771 0.05152568 0.15085976
```

We’ll want to convert these probabilities to a binary response since the `Churn`

variable we’re predicting is “Yes” or “No”.

I’m not sure how R is coding the response, but I can quickly check this using the `contrasts`

function.

```
# Looking at the response encoding
contrasts(df$Churn)
```

```
> contrasts(df$Churn)
Yes
No 0
Yes 1
```

Looking at the result, we can see that `Yes`

is being encoded using `1`

.

Now that we know the response encoding, we can convert our predicted results to `Yes`

and `No`

responses.

We’ll set the response threshold at `0.5`

, so if a predicted probability is above `0.5`

, we’ll convert this response to `Yes`

.

The final step is to convert the character responses to factor types. This is so that the encoding is correct for the logistic regression model.

```
# converting probabilities to "Yes" and "No"
glm.pred = rep("No", length(churn.probs))
glm.pred[churn.probs > 0.5] = "Yes"
glm.pred <- as.factor(glm.pred)
```

We’ll take a closer look at the threshold later, so don’t worry about why we set it at `0.5`

. For now, we’ll just use this value for purposes of this example.

An important part of making predictions is evaluating and validating the model.

Let’s take a detailed look at some evaluation metrics, and finish up with a more rigorous approach to model validation, k-fold cross-validation.

**Evaluating The Model**

After we’ve made predictions, it’s time to evaluate our model.

A great tool to quickly do this is using the `confusionMatrix`

function from Caret.

We’ll feed in our `glm.pred`

array of predictions, as well as the actual results from `test$Churn`

.

Finally, we’ll specify the positive class as “Yes” using `positive=”Yes”`

.

```
# creating a confusion matrix
confusionMatrix(glm.pred, test$Churn, positive = "Yes")
```

```
> confusionMatrix(glm.pred, test$Churn, positive = "Yes")
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 1165 205
Yes 128 262
Accuracy : 0.8108
95% CI : (0.7917, 0.8288)
No Information Rate : 0.7347
P-Value [Acc > NIR] : 4.239e-14
Kappa : 0.4877
Mcnemar's Test P-Value : 3.117e-05
Sensitivity : 0.5610
Specificity : 0.9010
Pos Pred Value : 0.6718
Neg Pred Value : 0.8504
Prevalence : 0.2653
Detection Rate : 0.1489
Detection Prevalence : 0.2216
Balanced Accuracy : 0.7310
'Positive' Class : Yes
```

This function produces both a confusion matrix, as well as other statistics of interest.

A confusion matrix displays how many correct and incorrect predictions were made for each class.

Here’s a quick look at the confusion matrix from our model:

We can see that the model correctly predicted “No” 1165 times, and incorrectly predicted “No” when the actual response was “Yes” 205 times.

Likewise, the model correctly predicted “Yes” when the actual answer was “Yes” 262 times. At the same time, it incorrectly predicted “Yes” when the actual response was “No” 128 times.

Our overall accuracy is 81%. A simple baseline model is to predict the majority class, which in our case is “No”.

If we just predicted the majority class, our accuracy would be 73%. There’s 1760 observations in the test set, and 1293 are “No”. If you divide the 1293 by 1760 that’s how you arrive at 73% accuracy.

Some other useful metrics are **sensitivity** and **specificity**. Since our classes are slightly imbalanced (~73%=“No”, ~27%=“Yes”) these metrics can be more useful.

The **sensitivity** indicates the **“True Positive”** rate. This is a measure of how accurately we predicted the positive class, which in our case is “Yes”.

The `confusionMatrix`

function reports this directly, but if you want to calculate it yourself, divide the “True Positives” by the sum of the “True Positives” and “False Negatives”. Here’s a visual to make that clearer:

Another useful metric is the **specificity**, which indicates the **“True Negative”** rate. This is a measure of how accurately we predicted the negative class. Here’s a visual explaining specificity:

Another useful metric is the Area Under the Receiver Operating Characteristic (ROC) Curve, also known as **AUC**.

Initially implemented during World War II for analyzing radar signals, the ROC is a plot of the **True Positive Rate vs. the False Positive Rate**.

Going back to our original model, I stated that I would use 0.5 as the threshold for making “yes” (or positive) predictions. I didn’t really have a good justification for picking 0.5 though.

The ROC is a great tool because it plots the TPR vs. the FPR as the threshold is varied. We can make a plot of the ROC curve using the ROCR library. Here’s the full R code:

```
library(ROCR)
# need to create prediction object from ROCR
pr <- prediction(churn.probs, test$Churn)
# plotting ROC curve
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)
```

As mentioned before, another useful measure is the area under the ROC curve, known as the AUC.

The AUC can take on any value between 0 and 1, with 1 being the best. This is a convenient way to boil down the ROC to a single number for evaluating a model.

Here’s the R code for evaluating our model:

```
# AUC value
auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc
```

```
> auc
[1] 0.8481338
```

Our model has an AUC of 0.85, which is pretty good. If we were to just make random guesses, our ROC would be a 45 degree line. This would correspond to an AUC of 0.5.

We’re outperforming random guessing, so we know that our model is at least adding some value!

**K-fold Cross Validation**

Now that we’ve trained, tested, and evaluated our model, let’s evaluate the model a little more rigorously.

We split our dataset in to training and test datasets, which is a great way to try and prevent overfitting.

An even better approach is using K-fold Cross Validation.

In this model validation technique, we randomly partition the data in to test and training sets by specifying a certain number of “folds”. In the example above, the number of folds is k=4.

After we run the model on each fold, we average the evaluation metric from each one. So if we ran the model four times using ROC, we would average each of the four ROC values together. This is a great way to try and prevent overfitting a model.

A common number of folds to use is 10, so that’s what we’ll use with our data. We’ll also repeat the process 3 times, just to add a little bit more technical rigor to our approach. Here’s the full R-code:

```
# setting a seed for reproduceability
set.seed(10)
# changing the positive class to "Yes"
df$Churn <- as.character(df$Churn)
df$Churn[df$Churn == "No"] <- "Y"
df$Churn[df$Churn == "Yes"] <- "N"
df$Churn[df$Churn == "Y"] <- "Yes"
df$Churn[df$Churn == "N"] <- "No"
# train control
fitControl <- trainControl(## 10-fold CV
method = "repeatedcv",
number = 10,
## repeated 3 times
repeats = 3,
classProbs = TRUE,
summaryFunction = twoClassSummary)
# logistic regression model
logreg <- train(Churn ~., df,
method = "glm",
family = "binomial",
trControl = fitControl,
metric = "ROC")
logreg
```

```
> logreg
Generalized Linear Model
7043 samples
19 predictor
2 classes: 'No', 'Yes'
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 6338, 6339, 6338, 6339, 6339, 6339, ...
Resampling results:
ROC Sens Spec
0.8455546 0.5500297 0.89602
```

We start by setting up k-fold CV settings using the `trainControl`

function. All of the inputs are pretty straightforward. As I mentioned previously, we’re using 10 folds, repeated 3 times.

Next we train our model. As we did earlier, we’re using the “binomial” family from the “glm” method. To evaluate our model, we’ll use “ROC”. The model is actually reporting AUC, but the way we specify that in the `train`

function is with `metric=”ROC”`

.

You’ll also notice that I changed the positive class to “Yes” right before the code for the `trainControl`

function. I did this so that I could compare the sensitivity and specificity with our previous results. This is just a minor quirk in the function, so don’t get hung up on this.

The results are similar to what we got previously.

As before, our **AUC is 0.85**. This is reported in the output as ROC, but it’s actually the AUC.

The **true positive rate** (sensitivity) is **0.55** and the **true negative rate** (specificity) is **0.90**.

## Business Impact

So far we’ve used **k-fold cross validation** and **logistic regression** to predict customer churn.

We’ve looked at useful evaluation metrics such as **AUC**, **sensitivity**, and **specificity**.

That’s all well and good, but now what?

If we went to the CEO and presented these results alone, he or she would say **“so what?”**

Our actual goal with developing this model is to show a **business impact**. In our case, this would be **cost savings**.

Let me walk through how we would show a cost savings, step-by-step.

To start, let’s make some assumptions about some various costs.

Let’s assume that **customer acquisition cost** in the telecom industry is approximately **$300**. If we make a prediction that a customer won’t churn, but they actually do (false negative, FN), then we’ll have to go out and spend $300 to acquire a replacement for that customer.

Now let’s assume that it’s 5 times more expensive to acquire a new customer rather than retain an existing one. If we predict that a customer will churn, we’ll need to spend **$60 to retain that customer**.

Sometimes we’ll correctly predict that a customer will churn (true positive, TP), and sometimes we’ll incorrectly predict that a customer will churn (false positive, FP). In both cases, we’ll spend $60 to retain the customer.

Finally, there’s the scenario where we correctly predict that a customer won’t churn (true negative, TN). In this scenario we won’t spend any money. These are happy customers that we’ve correctly identified as happy.

Here’s a quick summary of the costs:

**FN**(predict that a customer won’t churn, but they actually do):**$300****TP**(predict that a customer would churn, when they actually would):**$60****FP**(predict that a customer would churn, when they actually wouldn’t):**$60****TN**(predict that a customer won’t churn, when they actually wouldn’t):**$0**

If we multiply the number of each prediction type (FN, TP, FP, and TN) by the cost associated with each and sum them up, then we can calculate the total cost associated with our model. Here’s what that equation looks like:

Cost = FN($300) + TP($60) + FP($60) + TN($0)

As an example, let’s assume that we make the following number of predictions of each:

- FN = 10
- TP = 5
- FP = 5
- TN = 5

Then our total cost would be **$3600**.

10*($300) + 5*($60) + 5*($60) + 5*($0) = **$3600**

Now let’s apply this cost evaluation to our model.

We’ll start by fitting the model, and making predictions in the form of probabilities.

```
# fitting the logistic regression model
fit <- glm(Churn~., data=train, family=binomial)
# making predictions
churn.probs <- predict(fit, test, type="response")
```

Next, let’s create a threshold vector and a cost vector.

```
# threshold vector
thresh <- seq(0.1,1.0, length = 10)
#cost vector
cost = rep(0,length(thresh))
```

The threshold vector will hold the threshold for each model. Before we were just using 0.5 as the threshold, but let’s look at thresholds in increments of 0.1 between 0 and 1 (e.g., 0.1, 0.2, 0.3…0.9, 1.0).

The cost vector will just be a vector of length 10 with zeros to start. We’ll fill this in as we loop through the function, evaluating each threshold as we go. To evaluate the cost, we’ll use the same cost equation we just went over.

Now we’ll create a for loop to make predictions using the various thresholds, and evaluate the cost of each as we go.

```
# cost as a function of threshold
for (i in 1:length(thresh)){
glm.pred = rep("No", length(churn.probs))
glm.pred[churn.probs > thresh[i]] = "Yes"
glm.pred <- as.factor(glm.pred)
x <- confusionMatrix(glm.pred, test$Churn, positive = "Yes")
TN <- x$table[1]/1760
FP <- x$table[2]/1760
FN <- x$table[3]/1760
TP <- x$table[4]/1760
cost[i] = FN*300 + TP*60 + FP*60 + TN*0
}
```

Rather than use the total number of each outcome for TN, FP, FN, and TP, I used a percentage instead. That’s why I pulled each value out using the confusion matrix and divided by 1760.

There’s 1760 observations in our test set, so that’s where that number comes from. By doing it this way I’m calculating the **cost per customer**.

Now let’s assume that our company is currently using what we’ll call a **“simple model”** which just defaults to a **threshold of 0.5**. We can go ahead and fit that model, make predictions, and calculate the cost. We’ll call this model `cost_simple`

.

```
# simple model - assume threshold is 0.5
glm.pred = rep("No", length(churn.probs))
glm.pred[churn.probs > 0.5] = "Yes"
glm.pred <- as.factor(glm.pred)
x <- confusionMatrix(glm.pred, test$Churn, positive = "Yes")
TN <- x$table[1]/1760
FP <- x$table[2]/1760
FN <- x$table[3]/1760
TP <- x$table[4]/1760
cost_simple = FN*300 + TP*60 + FP*60 + TN*0
```

Finally, we can put all of the results in a dataframe and plot them.

```
# putting results in a dataframe for plotting
dat <- data.frame(
model = c(rep("optimized",10),"simple"),
cost_per_customer = c(cost,cost_simple),
threshold = c(thresh,0.5)
)
# plotting
ggplot(dat, aes(x = threshold, y = cost_per_customer, group = model, colour = model)) +
geom_line() +
geom_point()
```

Looking at the results, we can see that the **minimum cost per customer** is about **$40.00** at a threshold of **0.2**.

The **“simple” model** that our company is currently implementing costs about **$48.00** per customer, at a threshold of **0.50**.

If we assume that we have a customer base of approximately **500,000** the switching from the simple model to the optimized model produces a **cost savings of $4MM**.

```
# cost savings of optimized model (threshold = 0.2) compared to baseline model (threshold = 0.5)
savings_per_customer = cost_simple - min(cost)
total_savings = 500000*savings_per_customer
total_savings
```

```
> total_savings
[1] 4107955
```

This type of cost savings could be a significant business impact depending on the size of the business.

## Conclusion

In this post we developed a machine learning model to predict **customer churn**.

Specifically, we walked through each of the following steps:

- Business Background
- Logistic Regression
- Preparing The Data
- Fitting The Model
- Making Predictions
- Business Impact

We concluded by developing an **optimized logistic regression model** for our customer churn problem.

Assuming the company is using a logistic regression model with a **default threshold of 0.5**, we were able to identify that the **optimum threshold is actually 0.2**.

This reduced cost per customer from **$48.00** to **$40.00**. With a customer base of approximately **500,000** this would produce a yearly **cost savings of $4MM**. This was a significant business impact!

Although this was a purely hypothetical example, it’s very similar to what you will encounter in the real-world.

Being able to effectively identify a problem and show a real **business impact** will set you apart from other data scientists in the job market.

It’s one thing to be able to implement a machine learning model and evaluate it. If you can do all of that and show results in the form of a business impact as we did in this post, then you’ll be well on your way to landing a job!