Research Guides: Data Cleaning: Clean Your Numerical Data

Numerical Data Stored As Text

While numerical data is frequently stored in integer or floating-point numerical format, there are many instances of it being stored as part of a text string. This can make it more difficult to extract and make use of in a statiscial model. When this is the case, it can be helpful to know how to trim unwanted characters, as well as how to convert between data types, and how to avoid pitfalls when doing so.

Commas, Percent Signs, Currency Symbols, etc.

Numerical data will frequently contain non-numeric characters, including currency symbols, commas, and percent signs. In many countries where English is an official or administrative language, numbers in the thousands and above are denoted with commas as a delimiter or "thousands separator". as in:

10,000 for "ten thousand."

In many other countries, if a thousands separator is used, it is typically a dot or a space, as in:

10 000 or 10.000 for "ten thousand"

In such cases, commas are often used instead of dots as the decimal separator.

If the number in question is stored as an integer or floating point number (decimal) and follows the expected formatting of commas for thousands and a dot to separate decimal places, many statistical software platforms will automatically remove the commas when loading the data. However, sometimes when the data is stored as a text string, or in the other format (dots or spaces for thousands, comma for decimal), the user must edit the punctuation.

Editing these symbols out will typically require working with string functions/methods (see "Clean Your Text Data").

Discrete/Continuous Numerical Variables

Numerical data typically comes in two varieties, discrete and continuous. Discrete numerical data is a count of something indivisible, such as number of patients who receive a placebo in a medical study, or the number of rooms in a house. Continuous numerical data is a measurement of something in units that can be subdivided, such as height in inches or cost in dollars.

Crucially, continuous numerical data can fit a normal distribution; discrete numerical data cannot.

Numerical data can be categorized differently depending on which platform you are working in. In Python and R, integers are just called "integers". In Stata, integers over a certain size (less than 32,767 or greater than 32,740) are called "long" integers. Stata also has an additional "byte" data type for integers between -127 and 100. Stata's additional terms may seem confusing, but they allow data in a confined range to be stored using less memory.

Numbers with a decimal component are referred to in Python simply as "floats" (for "floating-point" numbers), whereas in R, they are referred to as "doubles" (for "double-precision floating point" numbers). In Stata, floating-point numbers can be stored as "floats" for values with 8 digits of accuracy, or "doubles" for values with up to 16 digits of accuracy.

Time Data

Time data can be some of the most complicated and troublesome numerical data to clean. Many software platforms (and, consequently, sources of data) have wildly different default formats for denoting time.

Times can be written in 12- or 24-hour formats, and location-based data is usually tied to a particular time zone, which can also be affected by Daylight Saving Time.

ISO-8601 is a commonly-used international standard for date formatting, written from largest to smallest interval:

YYYY-MM-DD HH:MM:SS:SSS+00:00

(for Year-Month-Day Hours:Minutes:Seconds:Milliseconds+Time zone offset)

This format has the advantage of being easily sortable in a programming environment. Unfortunately, there are many different interpretations of ISO-8601, with different conventions for punctuation, so if you work extensively with time data, you will probably encounter variations that must be reformatted to be compatible with one another.

Another format that you'll likely run into at some point is Unix time, which measures the number of seconds since midnight UTC (Coordinated Universal Time) on January 1, 1970. It's also not uncommon for Unix time to be measured in milliseconds or nanoseconds for additional granularity. This format is useful for computational purposes, since it's expressed in integers, but is basically unreadable for human beings. At the time of writing this guide, Tuesday, April 29, 2025 16:29:32 PDT, the Unix timestamp is 1745969372.

Time series data often necessitates working with time on a cyclical basis, tracking fluctuations and periodicity of events. Please note: stock market data excludes weekend dates, a detail which can be important to take into account when running regressions on time series data.

Note: The topic of cleaning time data is a much larger one than this guide can fully cover. At this time, the Library does not have a published research guide on date/time formatting. When such a guide is published, a link to it will be provided here.

Python Standard Library Documentation - datetime Module
Python's default package for handling dates and times, string formatting for representing them, and arithmetic operations performed on them.
Python dateutil Module
The dateutil module is an extension of the datetime module that comes in Python's standard library of packages. Dateutil allows for easier handling of relative times, intuitive date parsing, and computing dates based on recurring patterns.
Stata - Datetime
R for Data Science - Dates and Times
Lubridate - R Package for Time Data

Categorical Numerical Data

Often used in the social sciences and other fields that rely on survey data, rating scales such as Likert scales provide numerical categories that correspond to categories along a continuum. While these are recorded as numbers, their values don't necessarily follow an even scale; on a four-point Likert scale, the "distance" between "agree strongly" and "agree somewhat" is subjective, and may not be equivalent to the distance as the individual sees it between "agree somewhat" and "disagree somewhat".

Because of this distinction, it is important to treat these variables with care when incorporating them into statistical models.

It is also important to note that there is a difference between rating scales that have a neutral option (neither agree nor disagree) and those that force a decision between slight agreement and slight disagreement; a neutral option is sometimes functionally equivalent to "did not answer" or "no data".