/*Font style and formatting for LibGuides*/
Skip to Main ContentFinding the right tool for the job is in an important part of the research project. Depending on your research discipline, the scale and specific qualities of your dataset, how much time you have that you can invest in learning a new skill, and whether a service is free or requires payment, different options may be better for your particular data-cleaning needs.
For instance, if you are studying Econometrics, Stata might be your first option, since it's popular among researchers in that field. However, if you do not have access to a Stata license through your institution, or you have time to practice coding and you prefer having more control over your tables, visualizations, and regression analysis, R may be more appealing as a free, open-source alternative. If, however, you're working with an enormous cloud-hosted dataset that's too big for your computer to keep in RAM, streaming the data for analysis in Python using the Dask package may end up being your best option.
On the other end of the scale, if you're working on a small project with categorical data and you don't have time to learn a whole new programming language, OpenRefine may be the solution you need.
Some of the tools described below are comparable; R and Python are very close in terms of their functionality. However, these tools aren't always analogues. Stata can do some of what R and Python can do, but has a more limited scope, which can prove both detrimental and beneficial depending on your needs. OpenRefine is intended for data cleaning, but not necessarily for the in-depth data analysis that usually follows. Spreadsheet applications are designed with accounting in mind, not necessarily data science.
Of the tools listed below, all are suitable on some level for data cleaning, but not all are usable as data analysis tools. You might choose to do data cleaning using one tool and analysis with another, depending on your level of comfort and the needs of your project.
Many times, professors will require all students in their classes to use the same statistical software. If you are already familiar with one platform but you must use another for a specific project, this guide is meant to prepare you for the unfamiliar elements of the new platform. Provided the decision of which platform to use hasn't been made for you, this guide is intended to help inform your decision of which to choose.
Python is a general-purpose programming language. Python wasn't initially designed to be used for data analysis and data science, but due to its flexible nature and high degree of extensibility, many practitioners of data science prefer it. Among the options listed here, Python is overall the best option for implementing deep neural networks, for automation, and for web hosting.
Python is comprised by chunks of code stored in ".py" files. The default suite of tools that comes with a Python installation is often referred to as "base" Python. Users can add their own .py files containing code that extends Python's capabilities. On their own, these files are often referred to as "modules". Collections of modules are often called "packages" or "libraries" (which often refer to collections of packages), although these three terms are often used interchangeably. The extensibility this modular system provides is one of Python's most important strengths.
The "pandas" package is the most commonly-used tool in Python for handling tabular data, and Python users have a wide range of tools available for visualization.
Python's learning curve is steeper than most of the other data analysis platforms on this page, but it is still very accessible as compared to most other programming languages.
Jupyter Notebook and JupyterLab (the latter of which is pictured below) provide clean interfaces for presenting both Python code and explanatory text in the same document.
Pros:
Free, open-source, usable for general-purpose programming, extensible, internally consistent, cross-platform compatible, has robust community support.
Cons:
Can be slow to learn compared to some other tools, advanced use necessitates understanding package dependencies and interactions between different versions of code modules, and possibly a deeper understanding of computer science principles.
R is a programming language designed for statistical computing, data analysis, and data visualization. Although they are different in many ways, R and Python have a large overlap in their capabilities with respect to these applications, particularly when considering their respective suites of third-party add-ons. Like Python, R lets users add code files to extend its functionality. In R, these are typically referred to as "packages."
R's tidyverse packages (including dplyr, readr, tidyr, tibble, ggplot2, and others) provide a streamlined approach to cleaning and analyzing data. In particular, ggplot2 is one of the most intuitive tools available for quickly generating elegant data visualizations.
RStudio is a fully-featured IDE (Interactive Development Environment) that gives users a great degree of control over its parameters and R coding workflow.
Pros:
Free, extensible, open-source, purpose-built for statistical programming, robust community support (not quite as large as Python's community, but comparable), streamlined workflow, great for creating visualizations of data.
Cons:
Lower internal consistency may be confusing for those already familiar with other programming languages, less optimal for general-purpose programming than Python.
Stata (sometimes written as STATA) is a proprietary data analysis software program. Unlike Python and R, which are both free and open-source software (FOSS), Stata is designed, developed, and updated by StataCorp LLC. Stata is far less extensible than either of the aforementioned programming languages, offering very little in terms of customization of functions, but it does provide an out-of-the-box solution for many statistical analysis applications, particularly in generating tables for linear regression. Stata's suite of tools is somewhat geared towards applications in econometrics. Compared to R and Python, Stata has many limitations. One major difference is Stata's inability to access multiple datasets simultaneously without using a workaround.
Pros:
Simple workflow, consistent command structure, purpose-built for statistical programming, de-facto industry standard in certain disciplines such as Econometrics, troubleshooting support provided on forums hosted by Statacorp.
Cons:
Closed-source, paid license, not as much community-driven support as R or Python, only loading one table at a time.
OpenRefine is a standalone software program for cleaning data. Formerly called GoogleRefine, OpenRefine is also a free, open-source tool, specifically intended to be used to clean data as part of a data analytics/data science workflow. OpenRefine is not a tool for extensive statistical modeling, but it is designed with ease of use in mind for those who wish to clean data. OpenRefine is particularly useful for reorganizing categorical data by clustering similar values (for instance, country of residence listed as "USA", "U.S.A.", "usa", and "U.S of A" in survey responses). OpenRefine's user interface maintains a step-by-step, fully-reversible record of actions taken for data-cleaning.
Pros:
Easy-to-use, process is local (not on a cloud service) so you maintain data privacy, provides explicit record of cleaning process
Cons:
Not as fully-featured as other options, not a sufficient tool to perform full-fledged data analysis.
Microsoft Excel (pictured below), Google Sheets, Apple Numbers, and LibreOffice are all examples of spreadsheet applications. When it comes to data manipulation, they are powerful tools, but they may require putting in substantially more effort to accomplish the same tasks as are possible using Python, R, or Stata. They can be used for cleaning data in similar fashion to OpenRefine, but generally speaking are more limited in their capabilities out-of-the-box, and do not preserve a timeline of changes to your data as OpenRefine does.
Pros:
Ubiquitous, either free or inexpensive, paid licenses are often easy to obtain through institutions.
Cons:
No record of changes made to data, computationally inefficient compared to other platforms, formulae are written in individual cells rather than in a block of code which can make editing more difficult and reviewing the workflow as a whole less transparent.
Julia is another high-level, general-purpose programming language. Unlike Python and R, whose roots are both in the early 1990s, Julia was published in 2012. Julia was designed with data science in mind, and is generally speaking faster than Python or R (but that depends on the particular implementation of code). However, Julia has a much smaller development community than either Python or R, and as such, may not be the best language to learn for novice programmers, since it may necessitate spending more time troubleshooting.
SPSS and SAS are closed-source statistical software suites that require a paid license. As with Stata, SAS (Statistical Analysis Software) and SPSS (Statistical Package for the Social Sciences) are purpose-built software programs for statistical analysis, which lack the inherent extensibility of open-source, free software like R and Python.
Please note: The Claremont Colleges Library does not currently offer support for SAS, but it is used by some departments at the Claremont Colleges.
As with natural languages, programming languages have their own syntax. In natural languages, it is often possible to use incorrect syntax but still be understood, but that is not the case in programming languages. Syntax dictates the placement of elements of the language such that the code will function correctly. Python, R, and Stata each have their own proprietary syntax and default behavior.
To illustrate this, let's take a look at how each platform handles importing a dataset stored in a .csv file. We will highlight key differences (as well as some similarities) in syntax between these three platforms.
Python, R, and Stata all deal with data in the form of tables called "dataframes" (Or "DataFrames" in Python's pandas package). Dataframes are organized like spreadsheets, but these three platforms offer more powerful tools for filtering, manipulating, and transforming data than traditional spreadsheet applications do.
Python's syntax is highly internally consistent, and fairly explicit.
"Base" Python (Python's default settings on launch) does not include a function to import a .csv file by default, so Python users must first import the pandas module in order to gain access to its "read_csv" function. Typically, pandas is abbreviated using "pd" as an alias.
In order to access the function (functions assigned to particular objects are also called "methods"), Python users must type the name of the module, followed by a dot, followed by the name of the function contained in the module. Then, the argument for the function must be passed as a string in parentheses directly following the function name. In order to keep the result of the function in memory, its output must be stored in a variable using the "=" assignment operator.
import pandas as pd
df = pd.read_csv("filename.csv", header=0)
Then, to access a column in that .csv (which is now a "DataFrame" object once it's been imported in Python), Python users must type the name of the variable containing the DataFrame, followed by the name of the column as a string (in either double or single quotes, provided that they match) surrounded by square brackets.
df["column_name"] (or df['column_name'])
This syntax mirrors the way lists and dictionaries are indexed in Python, using square brackets:
To retrieve a value stored in a dictionary, enter the key:
dictionary['key']
To retrieve an element of a list, enter its index (numbering starts at 0)
list_object[0]
R's syntax is less internally consistent than Python's, but also fairly explicit. R also uses parentheses to clearly define where user input goes. As in Python, single or double quotes must be used around strings.
R's typical assignment operator (outside of a function's arguments) is "<-". Inside a function, keyword arguments can be passed using "=":
df <- read.csv("filename.csv", header=TRUE)
OR
library(readr)
df <- read_csv("filename.csv", col_names=TRUE)
Once the .csv has been read into R as a dataframe (note that the term "dataframe" is not capitalized as it is in Python), a column may be accessed by typing the name of the dataframe followed by "$" followed by the name of the column. Because column names are not surrounded by quotation marks or brackets, they may not contain spaces.
df$column_name
In contrast, to access a single element of a list in R, double square brackets must be placed around the index of the element, although unlike in Python, indexing in R starts at 1, not zero.
list_object[[1]]
For R users who want a more advanced understanding: Under the hood, R's syntax is less consistent than Python's due to its inclusion of multiple kinds of Classes, each with its own system of syntax. Some R objects are constructed from Reference Class Objects, while others are instantiated as S3 and S4 Objects.
Stata's syntax is mostly internally consistent, but it is also inflexible. Unlike R and Python, Stata does not often use parentheses to pass arguments to functions (which are called commands in Stata). Rather, in most cases, arguments for functions (or in Stata terminology, options for commands) are written after a comma following the command and the object it will affect. You can use single or double quotes around the file path, but they are not necessary. File extensions are also optional.
The dot preceding the command appears by default in the Stata console when interacting "live" with Stata, but must be explicitly entered before commands in a .do file (Stata's file format for scripting) in order for the commands to be executed.
Both of the following commands will work when importing a file named "filename.csv":
. import delimited "filename.csv"
. import delimited filename
It is important to note that Stata's default filetype for storing data is ".dta", and for .dta files, there is an entirely different command for importing. For "filename.dta":
. use filename
In Stata, if you want to explore a column of a dataframe, you have to go into the Data Editor. There is no equivalent to pandas's "df['column_name']" or R's "df$column_name".
Stata's simplicity and ease of use come at the expense of flexibility; only one dataframe may be loaded at a time in Stata, and any operation that depends on having access to more than one dataframe will require the user to first merge the dataframes, possibly after having to export them in .dta format and re-import them with ". use".