الأحد، 2 فبراير 2020

Churn prediction for bank customers



Churn prediction for bank customers 






Definition


Project Overview

This project is a part of Udacity Data Scientist Nanodegree Program.

In this project, I will try to find the bank customers who left the bank (closed their account) and who still continue to be a customer.
and I will try to understand the attributes of the customers and I will see if the age or gender affect to close their account.

Problem Statement

I want to decrease the number of customers who leave the bank but first I have to understand their attribute to know if there is anything that has them make this decision.

I will clean the data, then doing some exploratory analysis after that I will build a classification model.

My goal for this project is to create a model to help us predict the customers who will close their accounts.



Metrics

Accuracy is a common metric for binary classifiers which is calculated as: 

Accuracy = (True Positive + True Negative )/All

for that, I used accuracy for this project to evaluate my models and I choose the best model based on higher accuracy.


Analysis


Data Exploration \ preparation 

In this project, I have used data publicly available about Churn Modelling and was downloaded from Kaggle, and the dataset consists 10000 rows and 14 columns.

Here is the explanation of each variable in the data: 

- RowNumber: Row Numbers from 1 to 10000.
- CustomerId: Unique Ids for bank customer identification.
- Surname: Customer's last name.
- CreditScore: Credit score of the customer.
- Geography: The country from which the customer belongs.
- Gender: Male or Female.
- Age: Age of the customer.
- Tenure: Number of years for which the customer has been with the bank.
- Balance: Bank balance of the customer.
- NumOfProducts: Number of bank products the customer is utilising.
- HasCrCard: Binary Flag for whether the customer holds a credit card with the bank or not.
- IsActiveMember: Binary Flag for whether the customer is an active member with the bank or not.
- EstimatedSalary: Estimated salary of the customer in Dollars.
- ExitedBinary: flag 1 if the customer closed account with bank and 0 if the customer is retained.

The data was clean and doesn't have missing values, I have dropped the RowNumber column because the CustomerId column did the same job also I have dropped the Surname column because it will not help me with my analysis.




From the description above you can find that the mean of age is 39 and the max is 92 and the min is 18.

for the Tenure, the mean is 5 years and the max is 10 and the min is 0 which means less than one year.

and in the Balance, the mean is 76485 dollar and the max is 250898 dollar and the min is 0 dollar.
The customers who closed their accounts were 2037 and who are continues to be customers is 7963.  

Data Visualization

- Most customers were from which country?




Most customers were from France, and the customers who closed their accounts most of them were from France and Germany.

- Most customers were Male or Female?


The male customers are more than female by 18%, and the customers who closed their accounts most of them were Female.

- Most customers were active members or not?


51.51% of the customers were active and 48.49% not active, and the customers who closed their accounts most of them were not active.

- How many numbers of products with customers?


Most of the customers have one or two products, and the customers who closed their accounts most of them had one product, also the customers with four products all of them closed their account.

- what are the ages of the bank customers?


Most of the customers were from 30 to 40 years old

- What is the balance of customers who closed their accounts?


Most customers who closed their accounts were their balance between 100000 dollar to 150000 dollar.

Methodology

Data preprocessing

for preprocessing, I have to Re-encode categorical features so for gender column I converted female and male to 0 and 1, also I have used one-hot encoding function to convert the Geography column as shown below.


Implementation\Evaluation 

For building the models I have used two classification techniques naive Bayes and random forest, also I decided to dropped the CustomerId column before build the models because I think it will not help with prediction.

The accuracy for random forest model I have got 0.9857 for training data and 0.8597 for testing data, and for
naive Bayes model I have got 0.7823 for training data and 0.7937 for testing data.


Reflection

Based on the accuracy I decided to use the random forest model and I think it is better because it can handle data sets with higher dimensionality, and also can identify the most significant variables from thousands of input variables.


Also, I have used feature_importance_ object to determining the first five features that provide the most predictive power and as you can see below the NumOfproduct is the most feature help with predictive after that is balance, and I think maybe there are competitors give a better services and offers to the customers with high balance and have more products.



Conclusion

In this project, I tried to analyze and build model predict the bank customers who left the bank (closed their account) and who still continue to be a customer.

First I explored and preparer the data and see what I have to change before starting the analysis. Then I did some exploratory analysis and Visualization on the data. After that, I have built two models and chose the one with the highest accuracy.

From that analysis, I found out that most who closed their accounts were from France and Germany, and I think the bank should focus on marketing in those countries and see if there are competitors and try to provide better services, and trying to find why the customers with three and four products closed their accounts.

Improvements


To make the accuracy higher I tried to improve the model by using GridSearchCV and change on different parameters like max_depth and n_estimators so the accuracy I got is 0.8657.

It hasn't improved much and maybe that because the random forest is slow learning, also having more data is always good to improve the model results.

to see more detailed of the analysis with data and code check the Github repository here.

الاثنين، 18 نوفمبر 2019



Car Evaluation









Introducion:

This project (write a data science blog post) is a part of Udacity Data Scientist Nanodegree Program.

Nowadays you cannot go anywhere without having car ride and sometimes you cannot use taxi or Uber to reach to your destiny so choosing the best car for your essential needs is extremely important.

In this project, I have analyzed data publicly available about Car Evaluation and was downloaded from the UC Irvine Machine Learning Repository, and dataset consists of 1728 rows and 7 columns which are buying, maintenance, doors, persons, luggage booting, safety, and the last column is the class and it has 4 of class names Unacceptability, Acceptability, Good, Very good, were 70.023% of the data are Unacceptability, 22.222% are Acceptability, 3.993% are good and 3.762% are Very good.

So, I have tried to analyze and answer  the following questions:

1 - Is the very good car expensive?
2 - Is the safety will be high on the expensive the cars only?
3 - Is the price of cars affect the price of maintenance?

Part 1. Business and Data Understanding and Prepare Data:



The data have not missed values ​​and all data are categorical so I have to re-encode them.
after that, I decided to divide the data into 4 based on the classes so that it will help me to answer my questions.
and here is what I got:

What is really interesting is all the very good cars the buying price of them is low or medium, but the safety of them is high and the maintenance price is also low or medium but some of them are high.

And in the Unacceptability cars the buying price most of them are high and the safety is low and it is the only one have low safety comparing to the other classes, and there is not different in the maintenance price.





The good cars are similar to the very good but the safety is medium more than high.
The Acceptability cars the the buying price most of them are medium and the safety is high, and there is not different in the maintenance price.

Part 2. Data Modeling and Evaluate the Results:




finally, I built a model to help us to evaluate the cars by predicting the best car, I have used all features, and what is really interesting is as you can see in the bar chart below the size of luggage boot is the most important feature help to predict and the last thing is safety.



The model performs well since the accuracy is 0.9326 on the test set.

Conclusion:


At first, I thought that cars with high price mean high safety, but based on the data the cars with a high price most of them are evaluate Unacceptability and the safety is low, and very good cars are low in prices with high safety.
but I think the sample of the data is small so we can not make a decision based on it.


And from your point what affects the evaluation of the cars?

to see more detailed of the analysis with data and code check the Github repository here.