/*Font style and formatting for LibGuides*/ Skip to Main Content

Data Cleaning

An overview of categories data cleaning, general techniques, and available platforms for data cleaning.

What are "Strings"?

From raw written text to discrete chunks of descriptive information in a table, text data can take many forms. Text data in a computer system typically takes the form of "strings" of characters, which are usually stored as arrays (sequences of discrete elements). Strings may have different encodings depending on the platform they're created in, how they're stored in memory, and what language they're in.

Common string encodings used in digital files include Unicode, ASCII, and UTF-8, each a different way of representing characters.

String Manipulation

There are many tools in various data cleaning platforms that may be used to manipulate strings; each platform will have its own version of most of these tools.

For instance, no matter which platform you select, you will at some point have to replace characters in a string. Most tools for replacing characters can also be used to remove them entirely (by "replacing" them with nothing). Python's version of this tool is the .replace() string method. R has a similar function, gsub(). Stata's version is the command ". replace", which may be used conditionally with substr().

Here are some more examples of string manipulation tools:

Whitespace, Escape Characters, Case-Sensitivity

Text data can contain whitespace characters between other kinds of symbolic and alphanumeric characters. "Whitespace" typically describes spaces, tabs, and newline characters. For text strings in Python and R, tabs and newline characters are often represented by "\t" and "\n" respectively. The backslash ("\") character changes the behavior of whatever character appears after it, and is known as an "escape character". In order to get a normal backslash, write a backslash in front of a backslash ("\\").

When cleaning data, it is often necessary to remove or replace excess whitespace characters, such as when reading data from a table with column names that take up several lines.

Additionally, upper-case and lower-case letters are typically not equivalent in most programming environments. Depending on your research goal and the type of tool you are using, it may be beneficial for you to convert upper-case text to lower-case for the sake of uniformity. If you are coding survey responses and some respondents wrote "usa" while others wrote "USA" for country, you probably don't want those to be separate categories.

On the other hand, if you are doing natural language processing (NLP), you may want to preserve capitalization in your input data, so you can capture semantic differences and contextual information. For instance, the helping verb "may" is not the same as the month "May". If you are training a language model, it can be beneficial to keep raw data raw.

Regular Expressions

Regular expressions (or regex for short) are sequences of characters that can be used to find segments of text that match specified characteristics. They are used in search engines and in-page/in-document searches.

Regex is not a programming language unto itself, but rather a syntax for pattern recognition. This syntax is the same across implementations in different programming languages and software systems.

There are many free resources for learning regex patterns; you will not likely need to memorize them in order to use them, but you will need to understand how they are assembled.

Both regexr and regex101 offer live platforms to test out patterns on sample text; this can be a very effective means of developing new regex patterns for your use cases.

If you really want to test your understanding of regex syntax, you can do so with Regex Crossword Puzzles.

"Unstructured" Text Data/Natural Language Processing

Natural Language Processing (NLP) and Text Analysis are separate disciplines from what has been described here so far. These go beyond using text fields as categorical data and into the process of creating a dataset based on the text itself.

Although raw text data (usually) has an underlying grammatical structure, it is often referred to as "unstructured" data in that it does not adhere to a rigid structure that can be passed unaltered as an input to traditional statistical and predictive models. In recent years, unstructured data has become the domain of deep neural networks, such as transformer-based Large Language Models like ChatGPT, LLaMA, and Gemini.