Predicting bank deposit subscriptions using machine learning

Introduction

The ability to predict customer behaviour has become increasingly important for businesses, especially in the financial sector.

For banks, understanding which clients are likely to subscribe to a term deposit can lead to more effective marketing strategies and better resource allocation.

This analysis explores a beginner’s approach to predicting whether a client will subscribe to a term deposit using machine learning techniques.

In this exploration, we utilise a publicly available dataset from a bank’s marketing campaign. We aim to uncover key insights and build a basic predictive model.

Description of the bank deposit dataset

The dataset was created as part of a research project aimed at developing predictive models for direct marketing campaigns conducted by a Portuguese banking institution.

The campaigns were focused on promoting term deposits to potential clients through phone calls.

Dataset Characteristics

Total Instances (Records): 45,211

Total Attributes (Features): 17

Input Features: 16
Output Feature (Target Variable): 1

The dataset for this analysis can be downloaded from this GitHub link and here. The Python code for the analysis can be downloaded here.

Notable Observations

Imbalance

The dataset seems imbalanced, with a higher number of ‘no’ responses compared to ‘yes’.

This is an important consideration when building predictive models, as it may affect the performance and evaluation metrics.

Several techniques (eg. SMOTE) can be used to address this problem but may be outside the scope of our main objective for now.

Data Quality

Some features contain ‘unknown’ categories, which may require preprocessing steps such as imputation or exclusion. Using these methods would normally depend on the analysis approach.

Predictive Challenge

The dataset poses a realistic and challenging problem for classification algorithms, making it suitable for practicing and bench-marking various machine learning techniques.

Summary statistics

Categorical Variables

The summary statistics for the numerical variables provide a snapshot of the central tendencies and variability in the data:

Measure	Age	Balance	Day	Duration	Campaign	Pdays	Previous
Mean	41	1362.27	16	258	3	40	1
Std	11	3044.77	8	258	3	100	2
Min	18	-8019.00	1	0	1	-1	0
Max	95	102127.00	31	4918	63	871	275

Summary stat.

Age: The average age is around 41 years, with a minimum of 18 and a maximum of 95.
Balance: The average account balance is 1362 units, but the large standard deviation indicates significant variability. There’s also a notable negative balance, with the minimum value being -8019.
Day: The “day” column represents the last contact day of the month, with values ranging from 1 to 31.
Duration: Contact duration varies widely, with an average of 258 seconds and a maximum of nearly 5000 seconds.
Campaign: The number of contacts performed ranges from 1 to 63, with an average of about 2.76.
Pdays: This variable has a large range (from -1 to 871), indicating the days since the client was last contacted. The value of -1 appears to be significant and likely indicates that the client had not been previously contacted.
Previous: The number of contacts before this campaign, ranging from 0 to 275.

A group of green and white graphs

Description automatically generated

Categorical Variables

Job: The most common job types are blue-collar, management, and technician.
Marital Status: Most clients are married, followed by single and divorced.
Education: Secondary education is most common, followed by tertiary.
Default: Almost all clients do not have a credit default.
Housing: A large number of clients have housing loans.
Loan: Most clients do not have a personal loan.
Contact: The majority of contacts are made using unknown methods, followed by telephone.
Month: May is the most common month for contacts, followed by August, July, and June.
Poutcome: Most clients had no previous outcome recorded (unknown).
Subscription (y): The target variable shows that a minority of clients (around 11%) subscribed to a term deposit.

A screenshot of a graph

Description automatically generated

Predictive Analytics

In this analysis, a systematic approach is followed to build and evaluate a predictive model for determining whether a client will subscribe to a term deposit.

The steps are outlined below:

Data Preprocessing

Encoding Categorical Variables: Categorical variables are converted into numeric format using Label Encoding, making the data suitable for machine learning algorithms.
Feature Scaling: We standardised the numerical features to ensure that all features contribute equally to the model’s predictions.

Dataset Splitting

Training and Testing Split: The dataset was split into a training set (70%) and a testing set (30%) to train the model and evaluate its performance on unseen data.

Model Selection

Random Forest Classifier: We chose a Random Forest classifier due to its robustness and ability to handle complex datasets with multiple features.

Model Training

The model was trained using the training data to learn the relationships between the features and the target variable.

Model Evaluation

Predictions: The model made predictions on the test set.
Confusion Matrix: A confusion matrix was generated to evaluate the accuracy and error types (True Positives, False Positives, True Negatives, False Negatives).
Classification Report: Precision, recall, and F1-score were calculated to assess the model’s performance.
ROC Curve: The Receiver Operating Characteristic (ROC) curve and Area Under the Curve (AUC) were plotted to evaluate the model’s ability to distinguish between the two classes (yes/no).

Predictive Analytics Results

The Random Forest Classifier was used to predict whether a client will subscribe to a term deposit based on the available features.

Here are the results:

Confusion Matrix

This matrix indicates that the model is quite effective at predicting clients who will not subscribe, but less so for those who will.

A blue and white box with numbers

Description automatically generated

True Negatives (TN): 11,586

These are the instances where the model correctly predicted that the client would not subscribe to a term deposit. This is a correct rejection.

False Positives (FP): 380

These are the instances where the model predicted that the client would subscribe to a term deposit, but the client actually did not.

This is a type of error called a “false alarm.”

False Negatives (FN): 938

These are the instances where the model predicted that the client would not subscribe to a term deposit, but the client actually did.

This is a type of error called a “miss.”

True Positives (TP): 660

These are the instances where the model correctly predicted that the client would subscribe to a term deposit.

This is a correct identification.

Classification Report

Overall accuracy is 90%, with the model performing much better at identifying clients who will not subscribe (No) compared to those who will (Yes).

	Precision	Recall	f1-score	Support
No	0.93	0.97	0.95	11966.00
Yes	0.63	0.41	0.50	1598.00
Accuracy	0.90	0.90	0.90	0.90
Macro Avg	0.78	0.69	0.72	13564.00
Weighted Avg	0.89	0.90	0.89	13564.00

Precision indicates how many of the predicted “Yes” labels were actually correct. Here, about 63% of the clients predicted to subscribe actually did so.

Recall shows how many of the actual “Yes” cases were correctly identified by the model. The model correctly identified about 41% of the clients who subscribed.

The F1-score for the positive class (“Yes”) is 0.50, indicating a moderate balance between precision and recall.

ROC-AUC Score

The ROC-AUC score of 0.924 indicates that the model has a good ability to distinguish between the two classes (Yes and No).

A graph of a line

Description automatically generated

The ROC curve indicates a good balance between sensitivity (True Positive Rate) and specificity (False Positive Rate).

The area under the curve (AUC) of 0.924 further confirms that the model is effective in distinguishing between clients who will and will not subscribe to a term deposit.

Summary

Key findings from the analysis include:

The model achieved an overall accuracy of 90%, with a strong ability to correctly identify clients who would not subscribe.
The model’s ROC-AUC score of 0.924 indicates a high level of discrimination between those who will and will not subscribe.
However, the model had a lower recall for predicting positive cases (clients who subscribe), with a recall of 41% and a precision of 63%, indicating room for improvement in identifying potential subscribers.

Recommendations

Improve Model Sensitivity

SMOTE (Synthetic Minority Over-sampling Technique) may have to be considered to balance the dataset or cost-sensitive learning to reduce the impact of false negatives. This can help in better identifying potential subscribers.
Experiment with other algorithms like Gradient Boosting Machines (GBM), XGBoost, or Neural Networks, which may offer better performance for this classification task.

Feature Engineering

New features or interactions may have to be created between features that might improve model performance.

For example, consider the client’s behavior over time or the combination of balance with other financial indicators.

Conclusion

The prediction achieved a solid accuracy of 90%, indicating that it is effective in identifying clients who are unlikely to subscribe.

However, the model’s precision and recall for clients who do subscribe suggest that there is room for improvement, particularly in reducing the number of false negatives.

By refining the model and exploring alternative algorithms or additional features, banks can further enhance their predictive capabilities, leading to more targeted and efficient marketing campaigns.

Ultimately, this data-driven approach can help financial institutions not only improve their conversion rates but also foster stronger customer relationships through personalised marketing efforts.

REFERENCES

Moro,S., Rita,P., and Cortez,P.. (2012). Bank Marketing. UCI Machine Learning Repository. https://doi.org/10.24432/C5K306.

2 responses

John doe

August 24, 2024

How can I get the dataset?
Thanks
1. EU
  
  August 24, 2024
  
  Check the introduction section

Predicting bank deposit subscriptions using machine learning

Introduction