Skip to Main Content

Text & Data Mining

A brief guide to tools and resources (including datasets) for getting started with computational approaches to textual analysis.

Finding your data

Researchers applying text and data mining methods may have a well-established corpus of textual data they are already working with, or they may want to apply techniques to a variety of different data sources in a more exploratory way.

We've identified some data sources here that are easy to get started with, and offer a lot of potential for a variety of disciplinary research questions. 

HathiTrust Research Center

The HathiTrust Research Center (HTRC) provides a platform and training resources to support computational analysis of works in the HathiTrust Digital Library (HTDL) for educational purposes.
The Claremont Colleges Library is a member of the HathiTrust collaborative, which means all Claremont-affiliated users have full access to their collections.

HathiTrust includes digitized materials from member libraries across the globe, and includes over 18 million monograph titles, as well as many types of periodicals, manuscripts, and government documents. 

It's important to note that some of the materials the HTDL are still protected by copyright so there may be some limitations on their use for text and data mining applications. Please note and abide by the restrictions that you are advised of when working with HathiTrust. 

There are many training resources to help you get familiar with working with the collection.

Text mining of library databases

The terms of use agreements for most of our databases do not allow for text and data mining applications through their usual interface (the one you log into via the library website).

In some cases, there are other ways the vendor has provided to access large-scale datasets of their content. Please reach out to us to talk about your options before you get started. 

Social media

Newspapers and journals

Other text archives for TDM