Warehouse-data-Feature-Image

How to clean a messy warehouse data using Python: A step-by-step tutorial

Introduction

Dealing with messy data is a common challenge in data analysis and can significantly impact the results of your analysis.

Cleaning the data is the first and most crucial step toward obtaining reliable insights.

The dataset we will use in this tutorial contains information about a warehouse inventory, but it is plagued with issues such as inconsistent formatting, missing values, incorrect data types, etc.

You will be guided through each step of the cleaning process, using a Python script to transform this messy data into a clean and usable dataset.

This page is locked

Please fill this form to unlock the page

Loading...

Overview of the warehouse dataset

The dataset we are working with is composed of 1000 rows or records of inventory items and 10 columns.

The dataset can be downloaded using this GitHub link.

Each record has multiple attributes, such as the product name, category, quantity, price, warehouse location, supplier, last restocked date, and status.

However, the dataset needs cleaning up due to several issues, including:

  • Inconsistent Text Formatting: The Product Name and Category columns contain inconsistent use of uppercase and lowercase letters, making it difficult to group similar items.
  • Leading and Trailing Spaces: Several string columns have leading and trailing spaces, which can cause issues when performing operations like filtering or grouping data.
  • Incorrect Data Types: The Quantity and Price columns, which should be numeric, contain text entries and are stored as strings. This prevents numerical operations and analyses.
  • Invalid Values: Some entries in the Quantity, Price, and Last Restocked columns are marked as ‘NaN’, and the Quantity column even has a value recorded as ‘two hundred’.
  • Date Formatting Issues: The Last Restocked column contains dates in an inconsistent format which can lead to potential errors in date-related analyses.

Step-by-step guide to cleaning the dataset

1. Loading the Dataset

The first step is to load the messy dataset into a Pandas DataFrame. This allows us to examine the data and identify the issues that need to be addressed.

NB: Please replace the “path_to” with the actual location of the stored messy data on your device.

2. Stripping Leading and Trailing Spaces

Data often contains unnecessary spaces that can lead to mismatches and errors in analysis. We will strip any leading or trailing spaces from all string columns.

3. Standardising Text Formats

Inconsistent text formatting can make it difficult to group and analyse data. Here, we standardise the ‘Product Name’ and ‘Category’ columns by converting them to proper case and capitalising the first letter, respectively.

4. Correcting and Converting Data Types

Data types must be correct for proper analysis. We need to replace incorrect entries, convert text-based numbers to numeric types, and ensure that dates are in the correct format.

5. Handling Missing Values

Missing values can distort your analysis, so it’s crucial to handle them appropriately. We will fill numeric columns with the mean, and categorical columns with the most frequent value or a placeholder.

6. Saving the Cleaned Dataset

Finally, once the data has been cleaned, we save the cleaned dataset to a new CSV file.

NB: Please replace the “path_to” with the actual location where you would like the cleaned dataset to be stored on your device.

The cleaned dataset can be download using this GitHub link.

Below is just a snippet of the cleaned dataset:

[table id=9 /]

Summary of steps taken

  • Loaded the messy dataset into a Pandas DataFrame.
  • Stripped leading and trailing spaces from string columns.
  • Standardised text formats in the ‘Product Name’ and ‘Category’ columns.
  • Corrected and converted data types for the ‘Quantity’, ‘Price’, and ‘Last Restocked’ columns.
  • Handled missing values by filling them with appropriate defaults.
  • Saved the cleaned data into a new CSV file.

Conclusion

Cleaning a dataset is an essential step in data analysis, ensuring that your data is reliable and ready for further analysis.

By following the steps outlined in this tutorial, you can effectively clean messy datasets and transform them into valuable assets for your data projects.

Python, with its powerful Pandas library, provides a robust toolkit for tackling a wide range of data cleaning tasks.

Keep practicing with different datasets to hone your skills and become proficient in data cleaning.

One response

  1. […] interested, please refer to this link to see how the raw warehouse dataset was first cleaned using […]

Similar Posts