الأحد، 2 فبراير 2020

Churn prediction for bank customers



Churn prediction for bank customers 






Definition


Project Overview

This project is a part of Udacity Data Scientist Nanodegree Program.

In this project, I will try to find the bank customers who left the bank (closed their account) and who still continue to be a customer.
and I will try to understand the attributes of the customers and I will see if the age or gender affect to close their account.

Problem Statement

I want to decrease the number of customers who leave the bank but first I have to understand their attribute to know if there is anything that has them make this decision.

I will clean the data, then doing some exploratory analysis after that I will build a classification model.

My goal for this project is to create a model to help us predict the customers who will close their accounts.



Metrics

Accuracy is a common metric for binary classifiers which is calculated as: 

Accuracy = (True Positive + True Negative )/All

for that, I used accuracy for this project to evaluate my models and I choose the best model based on higher accuracy.


Analysis


Data Exploration \ preparation 

In this project, I have used data publicly available about Churn Modelling and was downloaded from Kaggle, and the dataset consists 10000 rows and 14 columns.

Here is the explanation of each variable in the data: 

- RowNumber: Row Numbers from 1 to 10000.
- CustomerId: Unique Ids for bank customer identification.
- Surname: Customer's last name.
- CreditScore: Credit score of the customer.
- Geography: The country from which the customer belongs.
- Gender: Male or Female.
- Age: Age of the customer.
- Tenure: Number of years for which the customer has been with the bank.
- Balance: Bank balance of the customer.
- NumOfProducts: Number of bank products the customer is utilising.
- HasCrCard: Binary Flag for whether the customer holds a credit card with the bank or not.
- IsActiveMember: Binary Flag for whether the customer is an active member with the bank or not.
- EstimatedSalary: Estimated salary of the customer in Dollars.
- ExitedBinary: flag 1 if the customer closed account with bank and 0 if the customer is retained.

The data was clean and doesn't have missing values, I have dropped the RowNumber column because the CustomerId column did the same job also I have dropped the Surname column because it will not help me with my analysis.




From the description above you can find that the mean of age is 39 and the max is 92 and the min is 18.

for the Tenure, the mean is 5 years and the max is 10 and the min is 0 which means less than one year.

and in the Balance, the mean is 76485 dollar and the max is 250898 dollar and the min is 0 dollar.
The customers who closed their accounts were 2037 and who are continues to be customers is 7963.  

Data Visualization

- Most customers were from which country?




Most customers were from France, and the customers who closed their accounts most of them were from France and Germany.

- Most customers were Male or Female?


The male customers are more than female by 18%, and the customers who closed their accounts most of them were Female.

- Most customers were active members or not?


51.51% of the customers were active and 48.49% not active, and the customers who closed their accounts most of them were not active.

- How many numbers of products with customers?


Most of the customers have one or two products, and the customers who closed their accounts most of them had one product, also the customers with four products all of them closed their account.

- what are the ages of the bank customers?


Most of the customers were from 30 to 40 years old

- What is the balance of customers who closed their accounts?


Most customers who closed their accounts were their balance between 100000 dollar to 150000 dollar.

Methodology

Data preprocessing

for preprocessing, I have to Re-encode categorical features so for gender column I converted female and male to 0 and 1, also I have used one-hot encoding function to convert the Geography column as shown below.


Implementation\Evaluation 

For building the models I have used two classification techniques naive Bayes and random forest, also I decided to dropped the CustomerId column before build the models because I think it will not help with prediction.

The accuracy for random forest model I have got 0.9857 for training data and 0.8597 for testing data, and for
naive Bayes model I have got 0.7823 for training data and 0.7937 for testing data.


Reflection

Based on the accuracy I decided to use the random forest model and I think it is better because it can handle data sets with higher dimensionality, and also can identify the most significant variables from thousands of input variables.


Also, I have used feature_importance_ object to determining the first five features that provide the most predictive power and as you can see below the NumOfproduct is the most feature help with predictive after that is balance, and I think maybe there are competitors give a better services and offers to the customers with high balance and have more products.



Conclusion

In this project, I tried to analyze and build model predict the bank customers who left the bank (closed their account) and who still continue to be a customer.

First I explored and preparer the data and see what I have to change before starting the analysis. Then I did some exploratory analysis and Visualization on the data. After that, I have built two models and chose the one with the highest accuracy.

From that analysis, I found out that most who closed their accounts were from France and Germany, and I think the bank should focus on marketing in those countries and see if there are competitors and try to provide better services, and trying to find why the customers with three and four products closed their accounts.

Improvements


To make the accuracy higher I tried to improve the model by using GridSearchCV and change on different parameters like max_depth and n_estimators so the accuracy I got is 0.8657.

It hasn't improved much and maybe that because the random forest is slow learning, also having more data is always good to improve the model results.

to see more detailed of the analysis with data and code check the Github repository here.

ليست هناك تعليقات:

إرسال تعليق