Research Guides: Data Cleaning: Deal with Missing Values

Missing Data

Often, datasets do not contain complete information. This can present problems, particularly when multiple fields have information missing.

There are several ways of dealing with this problem, but none of them are guaranteed to work in every situation.

If at all possible, try to find accurate data that you can use to fill in gaps in your dataset. Try to find an external data source that contains the data that your original dataset is missing.

If that is not possible, one common way of dealing with missing data is just to delete entries where data is missing. This is often called "listwise deletion". This can work if there is no correlation between characteristics of the data and which entries are missing; if, however, entries that match a particular criterion are more likely to be missing than entries matching other criteria, this process will introduce bias to a dataset. Regardless, because entries are being deleted outright in this method (and the sample size is being reduced), it can diminish the statistical power of any tests based on the data.

Yet another method is "mean substitution", a form of imputation where missing values are set to equal the mean of existing values. This can also introduce bias.

Transparency

No matter what you do to solve the problem of missing data, you should understand why you are choosing that method, and acknowledge potential shortcomings of techniques you use. Deleting data from a dataset or imputing values based on their expected value can have a significant effect on the bias of the dataset, and on that of the resulting model.

Summarizing your process and explaining your decisions will help you ensure that you accurately communicate your research process, as well as how confident you are with the results.

Missing Data in Python, R, and Stata

Missing data is represented differently and behaves differently across data cleaning platforms.

In Python, missing values are represented with "NaN" ("Not a Number"). Passing NaN values to a statistical function in Python will throw an error message, so you will have to replace them or edit them out prior to getting summary statistics.

R has several was to represent missing values: "NULL" (the default null value), "NA" (a logical constant like TRUE and FALSE), "NaN" ("Not a Number"), and "Inf"/"-Inf" (positive and negative infinity, which can be the results of calculations gone wrong.) R will ignore null values when using statistical functions, and compute statistics based on the existing data.

Stata's symbolic representation of null values is a single period ("."). Null values in Stata are calculated as infinitely large positive numbers, which will throw off summary statistics. It is very important to remember to replace or edit out these values in Stata.