Research Guides: Data Cleaning: Import Data Into Your Platform

Get to Know Your Dataset

Often when you use statistical programming environments or other statistical software, getting data into the platform of your choice may present challenges. Sometimes data files are formatted incorrectly, sometimes they are stored unintuitively, and sometimes they're not even packaged in a way that makes them easily portable.

When you are working with data, whether it's tabular data in a file, data stored in a relational database, data hosted on a webpage, a nested JSON file, image data, audio data, or so-called "unstructured" text data, it is helpful to know how data is structured, how it is stored, and how the commands and syntax in your data cleaning platform may differ depending on its characteristics.

Import Tabular Data

The most common type of data for data analysis/data science projects is tabular data (data in a table format).

While data may be stored in many different formats, including columnar data, relational databases (such as SQL), nested data (such as a JSON file), or as unstructured text, data is easier to work with when it's represented as rows and columns of a table. Columns (also called "fields", "variables", "features", or "attributes" depending on context) represent the category of each kind of data in the dataset and rows (also called "records", or in the context of relational databases, "tuples") represent data for uniquely identifiable entities within the set.

Broadly speaking, among possible data formats for storing the same kind of data, tables are the most intuitive format for human beings to understand. As such, most functions in statistical software tend to focus on manipulating tabular data.

Importing tabular data into the platform of your choice usually involves a few steps. Most often, for small projects, tabular data is stored as a .csv file (Comma-Separated Values), .xlsx (Microsoft Excel Workbook) file, or .tsv (Tab-Separated Values) file. Data may also come in the form of parquet files (an Apache free and open-source format), .dta files (Stata's proprietary format), .sav (SPSS's default format), .xpt/.sas7bdat files (SAS default formats), and others.

Occasionally, data will be stored in non-tabular formats, such as .json (JavaScript Object Notation), or relational databases (retrieved via SQL or other query languages). For these, you will need to take a different approach. (See "JSON - Extracting Nested Data" and "Relational Databases").

Here are some handy tutorials and documentation to get you started in your chosen data-cleaning platform:

Import Data in R
From "An Introduction to R" by Alex Douglas, Deon Roos, Francesca Mancini, Ana Couto & David Lusseau
Import Data in Stata
Stata official documentation on Importing
Read and Write Tabular Data in pandas (Python)
From pandas official documentation.
Start a Project in OpenRefine
From OpenRefine manual.

Relational Databases

Another common form of data storage is in a relational database. These are more complex than tabular data, as the data stored in them is spread amongst multiple repositories rather than one central table. The most common form of relational database uses Structured Query Language (SQL, often pronounced "sequel") as a means of retrieving and pre-packaging data in a form that can be parsed easily by humans and statistical programming environments.

There are many variations on SQL, as well as alternative query languages that use their own syntax.

SQLZoo
SQLZoo.net is a tutorial site for learning SQL using sample datasets, specific examples of queries, and quizzes.

APIs (Application Programming Interfaces)

Application Programming Interfaces are another way to retrieve data from an external source, according to guidelines set out by a website's developers. APIs provide a window into accessing stored data on a website that usually provides more flexibility and specificity than browsing a page hosted on the site.

Although "API" can mean any kind of application programming interface, the term often refers to Web APIs, which are APIs that deal with interactions between client devices and web servers. The most common form of these is what's called a REST (REpresentational State Transfer) API. Interactions with REST APIs are in the form of HTTP (HyperText Transfer Protocol) requests.

There are four categories of HTTP request: Get, Put, Post and Delete. "Put", "Post", and "Delete" all change what's on the webpage itself, but "Get" retrieves data from the webpage.

Responses to "Get" requests are typically given in the form of structured data, often in JSON or XML formats.

Many websites have their own APIs, usually with proprietary specifications that dictate the kinds of requests their servers will accept. Sometimes, websites will have distinct APIs for different services, with their own unique purpose and syntax.

Web Scraping

Sometimes, data isn't available as a pre-packaged dataset like a table, database, or web API. In these cases, your best option may be to scrape the data yourself from a source on the web. Web scraping can provide an alternative to the methods listed above, but it's not guaranteed to work, it's not as reliable, and it almost certainly will involve more time spent cleaning and pre-processing the data collected.

There are a few caveats to remember - web scraping isn't inherently illegal, but it may be subject to legal restrictions based on the user agreement for a given website. Also, if one is not careful with volume of requests, web scraping can cause problems either for the website or the user doing the scraping.

Please be sure to scrape responsibly!

The link below is for a past Library workshop in which the basics of web scraping are outlined:

Practical Python (Fall 2024)
Two-part workshop on September 25th and October 2nd, 2024, focusing on Python fundamentals and basic web-scraping and data visualization techniques.

JSON - Extract Nested Data

What happens when data isn't available in a tidy, tabular format? What if instead, it's stored in a tangled mass of curly and square braces? JavaScript Object Notation (JSON) is a very common format for data returned in API calls, as it's an efficient way of storing sparse data. JSON is meant to be consumed and processed by computers, not by humans.

JSON data consists of key:value pairs, with each data point (value) categorized by an identifier (key). Key:value pairs are written using a colon to designate the value contained within each key. This seems simple enough, but the values in this structure can also contain keys, creating multiple layers of data, with lower strata being inaccessible through the surface layer. This is typically referred to as "nested data", like a nesting doll (matryoshka). Getting data out of a nested format into a tabular format can be critical for using the data for statistical analysis.

This is an instance where fully-featured programming languages like Python and R far outshine other data cleaning platforms. Python's pandas package contains the .json_normalize() method, which flattens out nested data into a pandas DataFrame, automatically creating column names from the sequence of dictionary keys for each value at an end node in the JSON structure. R has a package called jsonlite that can help extract fields of data from JSON files. These tools are usually sufficient to get nested data into a tabular format, although there are exceptions.

If json_normalize or jsonlite aren't enough, Python and R also allow their users to write their own functions that can extract data in nested formats, using a concept called recursion.

Recursive functions can call themselves as subroutines when certain conditions arise. In this case, a recursive function that gets the values out of a set of key:value pairs would run itself again on any values in that structure that it determines to also contain key:value pairs, and so on. This is not an intuitive concept, but it can save enormous amounts of time that would be spent manually extracting data from these types of files.

PDF Data

Portable Document Format files (PDFs) can contain words and/or images. PDFs are handy as human-readable documents, but they are often not formatted such that they are easily machine-readable, at least not the way that spreadsheets or JSON objects are. Some PDF files include text data, which can be accessed using tools like the Python package pdfminer.six but the inclusion of this data is not guaranteed... often, the text in PDF files is just images of words, rather than as strings of characters (see Clean Your Text Data - What Are "Strings"?).

For cases where there is no text data embedded in the file, Optical Character Recognition (OCR) might offer an alternative. "OCR" describes a range of techniques used to convert images into text data. There are many software programs available used for OCR. Performance among OCR software varies widely, as does support for different languages and character systems, and the ability to recognize handwritten text.

ABBYY FineReader PDF is an excellent tool for OCR, but it requires a paid license. If you are looking for a free, open-source alternative, the pytesseract module allows you to run OCR via a Python script, but it is much less fully-featured.

Filter and Subset Data

Once you have gotten data into your platform of choice, you may find that you do not need all of it. Frequently, datasets contain more data fields than are necessary for a given data analysis/data science project. Dropping extraneous data columns/fields/variables from the working dataset can make it less cumbersome to work with.

Python, R, and Stata all have functions for dealing with this operation.

Selecting a subset of variables in a dataset isn't the only way data can be narrowed down; filtering rows of a dataset based on whether they match given criteria can also be important.

Python's pandas package and R's dplyr package offer similar ways of doing this, with some syntactical differences.

Python - Filter and Subset Data in Pandas
From pandas official documentation.
R - dplyr - Filter Datasets
From dplyr site (part of tidyverse).
R - dplyr - Select, Rename, Relocate
From dplyr site (part of tidyverse).
Stata - Delete Variables and Observations
From Stata official documentation.