/*Font style and formatting for LibGuides*/
Skip to Main ContentWhen we talk about "data cleaning" we are generally referring to a process of reviewing and editing a dataset to correct errors, remove duplicates, and ensure that the formatting and the content of the dataset are appropriate for our research question. This process varies quite a bit depending on the type of data, the size of the dataset, the statistical methods being used to address the research question, and the processes the data has already been through before it reaches the researcher. Datasets from a curated repository (such as ICPSR, for example) tend to require less cleaning than a dataset of survey responses that have been collected directly from a online form in which thousands of respondents entered data.
Data cleaning is not an obvious or objectively neutral process.
The decisions you make about how to clean data will affect your statistical models and the results of your data analysis. Consequently, it is important to maintain transparency regarding data cleaning decisions.
This guide is intended to help you through addressing some common tasks involved with cleaning and preparing data for statistical analysis. The decisions about when and how to use these methods deserve careful thought before proceeding.
There are three general objectives to keep in mind when you start cleaning data. The three Cs (adapted from the National Cancer Institute) are described as:
Regardless of what method you use to clean data or of the nature of the project, it is extremely important to preserve a copy of the original, unaltered data. This ensures reproducibility of your results, and allows for correction of any mistakes made during the data cleaning process.
All methods presented in this guide assume that cleaning is being done on a redundant copy of the original data. Remember to make a backup of any raw data you recorded yourself!
Not only is it a good practice to maintain the original, unedited data in a file in your project directory, but when working in a statistical programming environment or other analytical software, it is also helpful to keep an unedited copy of the data in memory (assuming the dataset isn't too large in size) so that you may quickly pivot if you determine that you need to undo or revise any cleaning or other transformation on the dataset.
Not every data science/data analysis project will be the same, but it is common to have to perform some steps of it repeatedly as you become more familiar with your dataset(s). This flowchart is not a definitive summary of the process between generating a research question and getting interpretable results, but it does illustrate the cyclical nature of this workflow.
The Claremont Colleges Library's GitHub page hosts repositories for past Python workshops. These workshops are presented in Jupyter Notebook format (.ipynb). To use these materials, either use Google Colab to open the GitHub link (see "Start Learning Python: Colab" box in the Introduction to Python research guide), or download the files to your local hard drive and use Jupyter Notebook or JupyterLab to open them (see the "Install Python" page in the Introduction to Python research guide).