/*Font style and formatting for LibGuides*/
Skip to Main ContentWhile numerical data is frequently stored in integer or floating-point numerical format, there are many instances of it being stored as part of a text string. This can make it more difficult to extract and make use of in a statiscial model. When this is the case, it can be helpful to know how to trim unwanted characters, as well as how to convert between data types, and how to avoid pitfalls when doing so.
Numerical data will frequently contain non-numeric characters, including currency symbols, commas, and percent signs. In the majority of English-speaking countries, numbers in the thousands and above are frequently stored with commas as a delimiter or "thousands separator". as in:
10,000 for "ten thousand."
In non-English-speaking countries, if a thousands separator is used, it is typically a dot or a space, as in:
10 000 or 10.000 for "ten thousand"
In such cases, commas are often used instead of dots as the decimal separator.
If the number in question is stored as an integer or floating point number (decimal) and follows the expected formatting of commas for thousands and a dot to separate decimal places, many statistical software platforms will automatically remove the commas when loading the data. However, sometimes when the data is stored as a text string, or in the other format (dots or spaces for thousands, comma for decimal), the user must edit the punctuation.
Editing these symbols out will typically require working with string functions/methods (see "Clean Your Text Data").
Numerical data typically comes in two varieties, discrete and continuous. Discrete numerical data is a count of something indivisible, such as number of patients who receive a placebo in a medical study, or the number of rooms in a house. Continuous numerical data is a measurement of something in units that can be subdivided, such as height in inches or cost in dollars.
Crucially, continuous numerical data can fit a normal distribution; discrete numerical data cannot.
Numerical data can be categorized differently depending on which platform you are working in. In Python and R, integers are just called "integers". In Stata, integers over a certain size (less than 32,767 or greater than 32,740) are called "long" integers. Stata also has an additional "byte" data type for integers between -127 and 100. Stata's additional terms may seem confusing, but they allow data in a confined range to be stored using less memory.
Numbers with a decimal component are referred to in Python simply as "floats" (for "floating-point" numbers), whereas in R, they are referred to as "doubles" (for "double-precision floating point" numbers). In Stata, floating-point numbers can be stored as "floats" for values with 8 digits of accuracy, or "doubles" for values with up to 16 digits of accuracy.
Time data can be some of the most complicated and troublesome numerical data to clean. Many software platforms (and, consequently, sources of data) have wildly different default formats for denoting time.
Times can be written in 12- or 24-hour formats, and location-based data is usually tied to a particular time zone, which can also be affected by Daylight Saving Time.
ISO-8601 is a commonly-used international standard for date formatting, written from largest to smallest interval:
YYYY-MM-DD HH:MM:SS:SSS+00:00
(for Year-Month-Day Hours:Minutes:Seconds:Milliseconds+Time zone offset)
This format has the advantage of being easily sortable in a programming environment. Unfortunately, there are many different interpretations of ISO-8601, with different conventions for punctuation, so if you work extensively with time data, you will probably encounter variations that must be reformatted to be compatible with one another.
Another format that you'll likely run into at some point is Unix time, which measures the number of seconds since midnight UTC (Coordinated Universal Time) on January 1, 1970. It's also not uncommon for Unix time to be measured in milliseconds or nanoseconds for additional granularity. This format is useful for computational purposes, since it's expressed in integers, but is basically unreadable for human beings. At the time of writing this guide, Tuesday, April 29, 2025 16:29:32 PDT, the Unix timestamp is 1745969372.
Time series data often necessitates working with time on a cyclical basis, tracking fluctuations and periodicity of events. Please note: stock market data excludes weekend dates, a detail which can be important to take into account when running regressions on time series data.
Note: The topic of cleaning time data is a much larger one than this guide can fully cover. At this time, the Library does not have a published research guide on date/time formatting. When such a guide is published, a link to it will be provided here.
Often used in the social sciences and other fields that rely on survey data, rating scales such as Likert scales provide numerical categories that correspond to categories along a continuum. While these are recorded as numbers, their values don't necessarily follow an even scale; on a four-point Likert scale, the "distance" between "agree strongly" and "agree somewhat" is subjective, and may not be equivalent to the distance as the individual sees it between "agree somewhat" and "disagree somewhat".
Because of this distinction, it is important to treat these variables with care when incorporating them into statistical models.
It is also important to note that there is a difference between rating scales that have a neutral option (neither agree nor disagree) and those that force a decision between slight agreement and slight disagreement; a neutral option is sometimes functionally equivalent to "did not answer" or "no data".