Skip to Main Content
The Library is now open for students, staff, and faculty of The Claremont Colleges. See COVID-19 Services and Updates for more information.

Text and Data Mining

A brief guide to tools and resources (including datasets) for getting started with computational approaches to textual analysis.

Definitions

Text and data mining (TDM) is an umbrella term to describe a broad array of methods, tools, and approaches to scholarship that involve applying computational methods to large bodies of (often unstructured) text.

TDM approaches are increasingly popular in an era of large-scale digitization efforts. The rapid growth of digital platforms for publication and social media have created massive corpora of textual data that is often easily accessible to anyone with internet access.

Researchers in diverse fields use TDM to gain insights from looking at long periods of time or across vast collections that would be nearly impossible with reading and examination in the traditional way. Quantifying the text elements, and developing analytical tools to count and visualize the terms of interest have opened up huge new bodies of scholarship in fields such as the digital humanities.

This guide is intended to help you get started identifying some basic tools and methods to approach a research question using TDM.

Cautions, critiques, and questions

Since TDM involves manipulations of published text that may be protected by copyright, it's important to make sure you understand the legal restrictions around the data you are working with before getting started. This can often be a complicated question. We've included some good resources here, and are happy to meet with you individually to consult on your particular project (though we cannot offer legal advice).

It is important to note that the terms of use agreements that govern our use of the databases that the library subscribes to (this will be anything you need to log in to with your Claremont credentials) do NOT allow for automated retrieval methods (such as scraping). If you are interested in downloading large numbers of articles from one of those databases, reach out to us to see what other methods we can figure out. 

 

Data Science & Digital Scholarship Coordinator

Profile Photo
Jeanine Finn
Contact:
Digital Strategies & Scholarship Division
Honnold, 3rd floor
The Claremont Colleges Library
909-607-7958