Job-posting-Feature-Image

How to clean a job postings dataset using Python

Introduction


When working with real-world data, it’s common to encounter datasets that are messy, inconsistent, and full of missing values.

Before diving into any analysis, it’s crucial to clean your data to ensure accuracy and reliability.

In this tutorial, we’ll be working with a dataset that contains job postings for data science positions. The data was scrapped from glassdoor’s website.

This page is locked

Please fill this form to unlock the page

Loading...

The dataset can be downloaded using this GitHub link.

This dataset has several issues as briefly described below.

Understanding the job postings data

The dataset contains 672 entries and 15 columns. The columns are as follows:

  1. index: A numerical index (likely not necessary and can be dropped).
  2. Job Title: The title of the job position.
  3. Salary Estimate: The salary range, which also includes some text (“Glassdoor est.”).
  4. Job Description: A textual description of the job.
  5. Rating: A numerical rating of the company.
  6. Company Name: The name of the company, but it also includes the rating in some cases.
  7. Location: The location of the job.
  8. Headquarters: The location of the company’s headquarters.
  9. Size: The size of the company (number of employees).
  10. Founded: The year the company was founded.
  11. Type of ownership: The type of ownership (e.g., Private, Public).
  12. Industry: The industry to which the company belongs.
  13. Sector: The sector of the economy.
  14. Revenue: The revenue of the company, often including text (e.g., “(USD)”).
  15. Competitors: The names of competitors, but some entries have “-1”, likely indicating missing data.

Suggested cleaning strategy

  1. Drop Unnecessary Columns: The index column may not be needed.
  2. Separate Company Name and Rating: The Company Name column often includes the company’s rating, which needs to be separated.
  3. Clean Salary Estimate: Remove extra text like “(Glassdoor est.)” and convert the salary to a numerical range.
  4. Handle Missing Values: Check for and appropriately handle missing values, particularly in the Competitors column.
  5. Standardise Formats: Ensure consistent formatting in columns like Size, Revenue, and Location.
  6. Extract Additional Features: Consider extracting features like minimum and maximum salary from the Salary Estimate column.

Step-by-Step data cleaning process

Step 1: Import the Necessary Libraries

First, import the Pandas library, which is essential for data manipulation in Python. Next, load the dataset into a Pandas DataFrame.

NB: Please replace the “path_to” with the actual location of the stored messy data on your device.

Step 2: Drop Unnecessary Columns

In this step, we’ll remove columns that are not needed for our analysis, such as the index and Job Description columns.

Step 3: Separate Company Name and Rating

The Company Name column contains both the company name and its rating, separated by a newline character. We’ll separate these into distinct columns.

Step 4: Clean the Salary Estimate Column

The Salary Estimate column contains salary ranges mixed with extra text, such as “Glassdoor est.” We need to clean this column and convert the salary values into a numerical format.

Step 5: Handle Missing Values

Some columns contain placeholder values such as -1 to indicate missing data. We’ll replace these with None, which is the Python representation of missing values.

Step 6: Standardise Formats

Finally, we ensure consistent formatting across the dataset, particularly in columns like Founded and Revenue.

Step 7: Saving the cleaned data

The dataset is now cleaned and ready for analysis.

NB: Please replace the “path_to” with the actual location where you would like the cleaned dataset to be stored on your device.

The cleaned dataset can be download using this GitHub link.

Below is just a snippet of the cleaned dataset:

Job TitleRatingCompany NameLocationHeadquartersSizeFoundedType of ownershipIndustrySectorRevenueCompetitorsMin SalaryMax Salary
Sr Data Scientist3.1HealthfirstNew York, NYNew York, NY1001 to 5000 employees1993.0Nonprofit OrganizationInsurance CarriersInsuranceEmblemHealth, UnitedHealth Group, Aetna137000.0171000.0
Data Scientist4.2ManTechChantilly, VAHerndon, VA5001 to 10000 employees1968.0Company - PublicResearch & DevelopmentBusiness Services$1 to $2 billion (USD)137000.0171000.0
Data Scientist3.8Analysis GroupBoston, MABoston, MA1001 to 5000 employees1981.0Private Practice / FirmConsultingBusiness Services$100 to $500 million (USD)137000.0171000.0
Data Scientist3.5INFICONNewton, MABad Ragaz, Switzerland501 to 1000 employees2000.0Company - PublicElectrical & Electronic ManufacturingManufacturing$100 to $500 million (USD)MKS Instruments, Pfeiffer Vacuum, Agilent Technologies137000.0171000.0
Data Scientist2.9Affinity SolutionsNew York, NYNew York, NY51 to 200 employees1998.0Company - PrivateAdvertising & MarketingBusiness ServicesCommerce Signals, Cardlytics, Yodlee137000.0171000.0
Data Scientist4.2HG InsightsSanta Barbara, CASanta Barbara, CA51 to 200 employees2010.0Company - PrivateComputer Hardware & SoftwareInformation Technology137000.0171000.0
Data Scientist / Machine Learning Expert3.9NovartisCambridge, MABasel, Switzerland10000+ employees1996.0Company - PublicBiotech & PharmaceuticalsBiotech & Pharmaceuticals$10+ billion (USD)137000.0171000.0
Data Scientist3.5iRobotBedford, MABedford, MA1001 to 5000 employees1990.0Company - PublicConsumer Electronics & Appliances StoresRetail$1 to $2 billion (USD)137000.0171000.0
Staff Data Scientist - Analytics4.4Intuit - DataSan Diego, CAMountain View, CA5001 to 10000 employees1983.0Company - PublicComputer Hardware & SoftwareInformation Technology$2 to $5 billion (USD)Square, PayPal, H&R Block137000.0171000.0

Summary of the data cleaning process

  • Loading the Dataset: We began by loading the dataset into a Pandas DataFrame, which allowed us to inspect and manipulate the data easily.
  • Dropping Unnecessary Columns: We removed columns that were not required for the analysis, specifically the index and Job Description columns, to streamline the dataset.
  • Separating Embedded Data: The Company Name column contained both the company name and its rating, separated by a newline character. We extracted the rating into a new Rating column and kept only the company name in the Company Name column.
  • Cleaning the Salary Data: The Salary Estimate column had salary ranges mixed with extra text, like “Glassdoor est.” We cleaned this column by removing the text and splitting the salary range into two separate columns: Min Salary and Max Salary.
  • Handling Missing Values: The Competitors column had placeholder values like -1 to indicate missing data. We replaced these with None, which is the standard way to represent missing data in Python.
  • Standardising Data Formats: We ensured that the Founded and Revenue columns were in a consistent format, converting the Founded year to a numerical format and standardising the Revenue column by removing any entries marked as “Unknown / Non-Applicable”.
  • Saving the Cleaned Dataset: Finally, the cleaned dataset was saved to a new CSV file, making it ready for further analysis.

Conclusion

By following these steps, we’ve successfully cleaned our dataset, making it well-structured and ready for analysis.

We removed unnecessary columns, separated embedded data, cleaned the salary estimates, and standardised formats. This process ensures that the data you analyse is accurate and reliable.

Cleaning data is an essential step in any data analysis project. It saves time and prevents errors in your analysis, making your results more trustworthy.

With a clean dataset, you’re now ready to dive into deeper data analysis and draw meaningful insights. Happy coding!

Similar Posts