How to clean a messy HR data using Python

Introduction

Dealing with messy datasets is a common challenge when working with real-world data as a Data Scientist or Data Analyst.

In this exercise, we’ll walk through the process of cleaning a typical messy Human Resources (HR) dataset using Python programming.

Overview of the dataset

The HR dataset we are working with represents a typical real-world dataset that hasn’t been carefully curated.

You can download the messy HR dataset using this GitHub link.

Here are the specific issues present in the dataset:

1. Inconsistent formatting

Numerical Columns with Text Entries: Columns like ‘Age’ and ‘Salary’ contain both numeric and text entries. For example, the ‘Age’ column include values such as “30”, “twenty-five”, and “thirty”, which should all be numeric. Similarly, the ‘Salary’ column includes numbers represented as words like “SIXTY THOUSAND” alongside numeric values like “60000”.
Spaces in Text Fields: Several text-based columns, such as ‘Name’, ‘Department’, and ‘Position’, have extra leading or trailing spaces.

2. Incorrect data types:

Dates Stored as Strings: The ‘Joining Date’ column, which should be in a date format, is stored as a string with varying formats. For instance, you will encounter dates like “2021-01-15”, “15/01/2021”, and “January 15, 2021” within the same column, making it difficult to perform time-based analysis or comparisons.

3. Missing values

Critical Columns with Missing Data: Several essential columns, such as ‘Email’, and ‘Phone Number’ have missing entries. This can lead to inaccurate analyses if not handled properly.

4. Placeholder and Invalid Data

Incorrect Placeholders: Some fields contain invalid placeholders that should not be present. For example, the ‘Salary’ column have placeholders like “NAN” instead of proper null values or numeric entries which can lead to errors during numerical calculations.
Inconsistent or Incorrect Phone Numbers and Emails: Contact information fields like ‘Phone Number’ and ‘Email’ contain invalid or placeholder data, which need to be identified and cleaned for accurate record-keeping.

Step 1: Loading the messy dataset

First, let’s load the dataset using pandas and numpy, which are powerful libraries in Python for data manipulation.

 import pandas as pd 
 import numpy as np 
 
 # Load the messy data 
 file_path_large_messy = '/path_to/messy_HR_data.csv' 
 df_large_messy = pd.read_csv(file_path_large_messy)

NB: Please replace the “path_to” with the actual location of the stored messy data on your device.

Step 2: Stripping extra spaces

Since the HR datasets are not carefully curated, they contain extra spaces around text. These can cause problems during analysis. Let’s remove those unnecessary spaces.

 # Strip leading and trailing spaces from string columns 
 for column in df_large_messy.select_dtypes(include=['object']).columns: 
 df_large_messy[column] = df_large_messy[column].str.strip()

Step 3: Correcting the “Age” column

In the ‘Age’ column, you will encounter textual representations of numbers (like ‘thirty’ instead of 30) or incorrect data types. We’ll standardise this column by converting all values to numeric and handling any errors.

 # Correct the Age column 
 df_large_messy['Age'] = df_large_messy['Age'].replace('thirty', 30) 
 df_large_messy['Age'] = pd.to_numeric(df_large_messy['Age'], errors='coerce') 

# Fill NaN values in the Age column before converting to int
df_large_messy['Age'] = df_large_messy['Age'].fillna(0).round(0).astype(int)  # Remove decimals and convert to int

Step 4: Cleaning the “Salary” column

Similarly, the ‘Salary’ column does contain inconsistent data, such as numbers written in words or placeholders for missing data. Let’s clean these up.

# Correct the Salary column 
 df_large_messy['Salary'] = df_large_messy['Salary'].str.replace('SIXTY THOUSAND', '60000') 
 df_large_messy['Salary'] = df_large_messy['Salary'].str.replace(' NAN ', 'NaN') 
 df_large_messy['Salary'] = pd.to_numeric(df_large_messy['Salary'], errors='coerce')

Step 5: Standardising date formats

Dates in datasets have inconsistent format. We’ll standardise the ‘Joining Date’ column to ensure it’s in a consistent date format.

 # Standardise the Joining Date column 
 df_large_messy['Joining Date'] = pd.to_datetime(df_large_messy['Joining Date'], errors='coerce')

Step 6: Handling missing values

Missing values can distort analysis, so we need to handle them carefully. We will fill in missing numerical data with the mean and categorical data with the most frequent value or a placeholder.

# Fill NaN values with a default value or strategy 
 df_large_messy['Age'].fillna(df_large_messy['Age'].mean(), inplace=True) 
 df_large_messy['Salary'].fillna(df_large_messy['Salary'].mean(), inplace=True) 
 
 # Fill NaN values for categorical columns 
 df_large_messy['Gender'].fillna('Unknown', inplace=True) 
 df_large_messy['Department'].fillna(df_large_messy['Department'].mode()[0], inplace=True) 
 df_large_messy['Position'].fillna(df_large_messy['Position'].mode()[0], inplace=True) 
 df_large_messy['Performance Score'].fillna('Unknown', inplace=True) 
 df_large_messy['Email'].fillna('no_email@example.com', inplace=True) 
 df_large_messy['Phone Number'].fillna('000-000-0000', inplace=True)

Step 7: Saving the cleaned dataset

Finally, after all the cleaning, we save the cleaned dataset to a new file for further analysis.

 # Save the cleaned data to a new CSV file 
 df_large_messy.to_csv('/path_to/cleaned_messy_HR_data.csv', index=False)

NB: Please replace the “path_to” with the actual location where you would like the cleaned dataset to be stored on your device.

You can download the cleaned HR dataset using this github link.

Summary of data cleaning steps

Loading the Dataset

Import necessary libraries (pandas, numpy).
Load the dataset from a CSV file into a pandas DataFrame.

Trimming Extra Spaces

Remove leading and trailing spaces from string columns using the str.strip() method.

Correcting Data Types

Convert text representations in the ‘Age’ and ‘Salary’ columns to numeric using pd.to_numeric(), handling non-numeric values gracefully.
Replace specific string representations (e.g., replacing “thirty” with 30 in the ‘Age’ column).

Standardising Date Formats

Convert the ‘Joining Date’ column to a uniform datetime format using pd.to_datetime(), handling errors to avoid data loss.

Handling Categorical Variables

Standardise categorical entries (e.g., ‘Department’, ‘Position’) by formatting strings for consistency (like converting all to title case).

Dealing with Missing Values

Identify and fill missing numerical values using the mean or median of the column.
Fill missing categorical values with the mode of the column or a designated placeholder (e.g., ‘Unknown’ for ‘Gender’).

Final Data Validation

Perform final checks to ensure that all columns are in the correct format and that there are no remaining null values.
Use df.info() and df.head() to visually inspect the cleaned dataset.

Saving the Cleaned Data

Save the cleaned dataset back to a CSV file for further analysis or operational use, ensuring no index column is included.

You can download the cleaned HR dataset using this GitHub link.

Below is just a snippet of the cleaned dataset:

[table id=7 /]

Conclusion

Cleaning a messy dataset is a critical step in the data analysis process. By using Python and pandas, you can efficiently handle common issues like inconsistent formatting, missing values, and incorrect data types.

This tutorial has walked you through the essential steps to transform a messy HR dataset into a clean, usable format ready for analysis. With these skills, you’ll be better equipped to handle real-world data challenges in your projects.

How to clean a messy HR data using Python

Introduction

This page is locked

Overview of the dataset

1. Inconsistent formatting

2. Incorrect data types:

3. Missing values

4. Placeholder and Invalid Data

Step 1: Loading the messy dataset

Step 2: Stripping extra spaces

Step 3: Correcting the “Age” column

Step 4: Cleaning the “Salary” column

Step 5: Standardising date formats

Step 6: Handling missing values

Step 7: Saving the cleaned dataset

Summary of data cleaning steps

Loading the Dataset

Trimming Extra Spaces

Correcting Data Types

Standardising Date Formats

Handling Categorical Variables

Dealing with Missing Values

Final Data Validation

Saving the Cleaned Data

Conclusion

How to create a spending trend dashboard in Power BI

Predicting bank deposit subscriptions using machine learning

How to clean a healthcare data using Python

How to analyse Electric Vehicles adoption rates using Python

How to predict income levels using machine learning

SQL exercises on retail shopping: Single table queries

Introduction

This page is locked

Overview of the dataset

1. Inconsistent formatting

2. Incorrect data types:

3. Missing values

4. Placeholder and Invalid Data

Step 1: Loading the messy dataset

Step 2: Stripping extra spaces

Step 3: Correcting the “Age” column

Step 4: Cleaning the “Salary” column

Step 5: Standardising date formats

Step 6: Handling missing values

Step 7: Saving the cleaned dataset

Summary of data cleaning steps

Loading the Dataset

Trimming Extra Spaces

Correcting Data Types

Standardising Date Formats

Handling Categorical Variables

Dealing with Missing Values

Final Data Validation

Saving the Cleaned Data

Conclusion

Similar Posts