Research Guides: Data Cleaning: Get Started with Data Cleaning

What is Data Cleaning?

When we talk about "data cleaning" we are generally referring to a process of reviewing and editing a dataset to correct errors, remove duplicates, and ensure that the formatting and the content of the dataset are appropriate for our research question. This process varies quite a bit depending on the type of data, the size of the dataset, the statistical methods being used to address the research question, and the processes the data has already been through before it reaches the researcher. Datasets from a curated repository (such as ICPSR, for example) tend to require less cleaning than a dataset of survey responses that have been collected directly from a online form in which thousands of respondents entered data.

Data cleaning is not an obvious or objectively neutral process.

The decisions you make about how to clean data will affect your statistical models and the results of your data analysis. Consequently, it is important to maintain transparency regarding data cleaning decisions.

This guide is intended to help you through addressing some common tasks involved with cleaning and preparing data for statistical analysis. The decisions about when and how to use these methods deserve careful thought before proceeding.

General Guidance

There are three general objectives to keep in mind when you start cleaning data. The three Cs (adapted from the National Cancer Institute) are described as:

Complete: you want your dataset to be as complete as possible. You want to avoid missing data as much as possible, and use a valid statstiical method to adjust for missing or null values.
Consistent: make sure the units of measurement are the same across the dataset.
Correct: Check for any outliers or values that seem to widely diverge from a normal distribution - that may indicate a data entry error. Some outliers are meaningful, but others may suggest an error. It is important to become familiar with your data before making that assessment.

IMPORTANT - Preserve the Original Data

Regardless of what method you use to clean data or of the nature of the project, it is extremely important to preserve a copy of the original, unaltered data. This ensures reproducibility of your results, and allows for correction of any mistakes made during the data cleaning process.

All methods presented in this guide assume that cleaning is being done on a redundant copy of the original data. Remember to make a backup of any raw data you recorded yourself!

Not only is it a good practice to maintain the original, unedited data in a file in your project directory, but when working in a statistical programming environment or other analytical software, it is also helpful to keep an unedited copy of the data in memory (assuming the dataset isn't too large in size) so that you may quickly pivot if you determine that you need to undo or revise any cleaning or other transformation on the dataset.

Data Cleaning Workflow

Not every data science/data analysis project will be the same, but it is common to have to perform some steps of it repeatedly as you become more familiar with your dataset(s). This flowchart is not a definitive summary of the process between generating a research question and getting interpretable results, but it does illustrate the cyclical nature of this workflow.

Data Cleaning Tutorials - Past Library Workshops

The Claremont Colleges Library's GitHub page hosts repositories for past Python workshops. These workshops are presented in Jupyter Notebook format (.ipynb). To use these materials, either use Google Colab to open the GitHub link (see "Start Learning Python: Colab" box in the Introduction to Python research guide), or download the files to your local hard drive and use Jupyter Notebook or JupyterLab to open them (see the "Install Python" page in the Introduction to Python research guide).

Persnickety Python (Spring 2025)
Workshop taught on Feb 13th, 2025 and again on April 1st, 2025.

Main focus: string manipulation in Python, cleaning text data using Python's pandas package, merging datasets. The workshop's main notebook contains a detailed demonstration of how to get summary statistics from a numerical variable where the units (different currency types) are not initially uniform, but may be derived by using a combination of other features in the dataset, along with an external data source.

Bonus topics: regular expressions, datetime formats, fixing improperly formatted .csv files, and importing text data from PDF files. Supplementary notebooks provide instructions for a range of other topics and techniques, each with its own dataset.

Data Services Specialist

David Merten-Jones

he/him

Email Me

Contact:

Distinctive Collections and Digital Scholarship - The Claremont Colleges Library

(909) 607-8363