How to clean a messy IMDB movies data using Python
Introduction
In the world of data science and analytics, the quality of your analysis is only as good as the quality of your data.
Unfortunately, real-world datasets are often messy, requiring extensive cleaning before you can extract meaningful insights.
In this tutorial, we’ll walk through a step-by-step process to clean a typical messy IMDB dataset using Python programming.
Overview of the IMDB movies data
This dataset includes 100 movies from the IMDb database and features 11 variables: IMDb movie ID, original title, release year, genre, duration, country, content rating, director’s name, worldwide income, number of votes, and IMDb score.
It is a disorganised dataset with numerous issues that need to be addressed, such as missing values, empty rows and columns, poorly named variables, inconsistent or incorrect date formats, numeric columns containing symbols, units, characters, thousand separators, multiple and incorrect decimal separators, typographical errors, and a multi-category variable improperly coded as a single character variable.
You can download the messy HR dataset using this GitHub link and kaggle.
Key observations
- Column Names: Some column names contain unexpected characters, likely due to encoding issues (e.g., “Original titlÊ” and “Genr먔).
- Missing Values: There are missing values in columns like “Duration” and “Content Rating.” Additionally, there is an “Unnamed: 8” column that is entirely empty.
- Data Types: Some columns (e.g., “Release year,” “Income,” “Votes,” and “Score”) are currently treated as object types, though they should be numeric or categorical.
- Inconsistent Formatting: Dates in the “Release year” column and numeric fields like “Votes” and “Income” appear to have inconsistent formatting.
Suggested steps for cleaning
- Rename Columns: Correct any issues with column names.
- Drop Unnecessary Columns: Remove the “Unnamed: 8” column, as it contains no useful information.
- Handle Missing Values: Address missing values through imputation or removal, depending on the column’s relevance.
- Convert Data Types: Convert columns like “Release year,” “Income,” “Votes,” and “Score” to appropriate data types.
- Standardised Formatting: Clean and standardize date formats and numeric fields.
- Remove or Correct Invalid Data: Identify and fix any invalid entries.
- The “Votes” column appears to have extra spaces in its name and needs fixing.
Step-by-step data cleaning using Python
1. Loading the Dataset
Let’s start by importing pandas.
import pandas as pd
Loading the dataset correctly is the first critical step. The dataset had encoding issues, so let’s use the ISO-8859-1 encoding to avoid errors and load the data accurately.
# Load the dataset with the correct encoding
df = pd.read_csv('/path_to/messy_IMDB_dataset.csv', encoding='ISO-8859-1', sep=None, engine='python')
NB: Please replace the “path_to” with the actual location of the stored messy data on your device.
2. Inspecting and diagnosing issues
An initial inspection helps identify the key issues in the dataset. Using pandas functions like head(), info(), and describe(), you can discover several problems: malformed column names, missing values, and incorrect data types.
# Inspect the dataset
df.head()
df.info()
df.describe(include='all')
3. Renaming columns
Descriptive and correctly spelled column names are essential for readability and maintenance. You can fix the corrupted column names caused by encoding issues.
# Rename columns
df.rename(columns={
'IMBD title ID': 'IMDB_title_ID',
'Original titlÊ': 'Original_title',
'Release year': 'Release_year',
'Genrë¨': 'Genre',
'Duration': 'Duration',
'Country': 'Country',
'Content Rating': 'Content_Rating',
'Director': 'Director',
'Income': 'Income',
'Votes': 'Votes',
'Score': 'Score'
}, inplace=True)
4. Handling missing values
Missing data can lead to inaccurate analysis. Different strategies can be used to handle missing values, such as filling with placeholders and dropping rows with critical missing data.
# Handle missing values
df['Duration'].fillna('Unknown', inplace=True)
df['Content_Rating'].fillna('Not Rated', inplace=True)
# Drop rows with critical missing data
df.dropna(subset=['IMDB_title_ID', 'Original_title', 'Release_year'], inplace=True)
5. Converting data types
Correct data types are necessary for accurate computations. You may convert several columns, such as dates, income, and votes, from text to their appropriate types.
# Convert data types
df['Release_year'] = pd.to_datetime(df['Release_year'], errors='coerce')
df['Income'] = pd.to_numeric(df['Income'].replace('[\$,]', '', regex=True), errors='coerce')
df['Votes'] = pd.to_numeric(df['Votes'].replace('[\.,]', '', regex=True), errors='coerce')
6. Standardising data
Consistency is key in data analysis. You may standardise the formatting of the ‘Score’ column, ensuring all values are numeric and correctly scaled.
# Clean and standardize the 'Score' column
df['Score'] = pd.to_numeric(df['Score'].replace('[^0-9.]', '', regex=True), errors='coerce')
# Normalise any scores greater than 10
df['Score'] = df['Score'].apply(lambda x: x/10 if x > 10 else x)
7. Final quality check
Finally, you may want to re-inspect the dataset to ensure all issues were resolved, and the data was clean and ready for analysis.
# Final inspection
df.info()
df.head()
Summary of steps taken
- Loading the Dataset: Used the correct encoding to load the dataset without errors.
- Inspecting the Data: Conducted an initial inspection to identify key issues.
- Renaming Columns: Corrected malformed column names for better readability.
- Handling Missing Values: Filled missing values with placeholders and dropped rows with essential missing data.
- Converting Data Types: Converted text fields to appropriate data types like datetime and numeric.
- Standardising Data: Cleaned and standardised numeric fields to ensure consistency.
- Final Quality Check: Conducted a final review to confirm that the dataset was clean.
You can download the cleaned HR dataset using this GitHub link.
Below is a snippet of the cleaned dataset:
IMDB_title_ID | Original_title | Release_year | Genre | Duration | Country | Content_Rating | Director | Unnamed: 8 | Income | Votes | Score |
---|---|---|---|---|---|---|---|---|---|---|---|
tt0111161 | The Shawshank Redemption | 1995-02-10 | Drama | 142 | USA | R | Frank Darabont | 28815245.0 | 2278845 | 9.3 | |
tt0068646 | The Godfather | Crime, Drama | 175 | USA | R | Francis Ford Coppola | 246120974.0 | 1572674 | 9.2 | ||
tt0468569 | The Dark Knight | Action, Crime, Drama | 152 | US | PG-13 | Christopher Nolan | 1005455211.0 | 2241615 | 9.0 | ||
tt0071562 | The Godfather: Part II | 1975-09-25 | Crime, Drama | 220 | USA | R | Francis Ford Coppola | 1098714 | 9.0 | ||
tt0110912 | Pulp Fiction | 1994-10-28 | Crime, Drama | USA | R | Quentin Tarantino | 222831817.0 | 1780147 | 8.9 | ||
tt0167260 | The Lord of the Rings: The Return of the King | Action, Adventure, Drama | 201 | New Zealand | PG-13 | Peter Jackson | 1142271098.0 | 1604280 | 8.9 | ||
tt0108052 | Schindler's List | 1994-03-11 | Biography, Drama, History | Nan | USA | R | Steven Spielberg | 322287794.0 | 1183248 | 8.9 | |
tt0050083 | 12 Angry Men | 1957-09-04 | Crime, Drama | 96 | USA | Not Rated | Sidney Lumet | 576.0 | 668473 | 8.9 | |
tt1375666 | Inception | 2010-09-24 | Action, Adventure, Sci-Fi | 148 | USA | PG-13 | Christopher Nolan | 869784991.0 | 2002816 |
Conclusion
Data cleaning is an indispensable step in any data analysis project. In this tutorial, we navigated through the process of cleaning a messy IMDB dataset, addressing issues such as encoding errors, missing values, and inconsistent formats.
By following these steps, you can ensure that your data is in top shape, paving the way for more accurate and insightful analysis.
Now that your dataset is clean, you’re ready to dive into data analysis or visualisation. Happy coding!