Research Guides: Text &amp; Data Mining: Finding Data Sources

Finding your data

Researchers applying text and data mining methods may have a well-established corpus of textual data they are already working with, or they may want to apply techniques to a variety of different data sources in a more exploratory way.

We've identified some data sources here that are easy to get started with, and offer a lot of potential for a variety of disciplinary research questions.

HathiTrust Research Center

The HathiTrust Research Center (HTRC) provides a platform and training resources to support computational analysis of works in the HathiTrust Digital Library (HTDL) for educational purposes.

All Claremont Colleges-affiliated users have full access to their collections since The Claremont Colleges Library is a member of the HathiTrust collaborative.

HathiTrust includes digitized materials from member libraries across the globe, and includes over 18 million monograph titles, as well as many types of periodicals, manuscripts, and government documents.

It's important to note that some of the materials found in the HTDL are still protected by copyright so there may be some limitations on their use for text and data mining applications. Please note and abide by the restrictions that you are advised of when working with HathiTrust.

There are many training resources to help you get familiar with working with the collection.

A Guide to the HathiTrust Research Center (Illinois Library)
Excellent intro to the tools and services at HTRC.
Text Mining with HathiTrust: Empowering Librarians to Support Digital Scholarship Research
Slides and training materials from a 2019 "train the trainer" symposium.
Text Mining Historical Primary Sources in HathiTrust's HTRC Analytics
February 2021 YouTube video from
Doing History with C.S.Robinson gives a solid overview of getting started with HathiTrust Analytics.

Text mining The Claremont Colleges Library databases

The terms of use agreements for most of our databases do not allow for text and data mining applications through their usual interface (the one you log into via the library website).

In some cases, there are other ways the vendor has provided to access large-scale datasets of their content. Please reach out to us to talk about your options before you get started.

Newspapers and journals

Chronicling America (LoC)
Digitized U.S. Newspapers from 1777-1963. Text-mining supported via API access.
New York Times Developer Portal
The New York Times offers access to a large portion of its content via an API. You will need to set up an account and have some scripting skills (such as with Python) to work with this data.

Other text archives for TDM

Digital Public Library of America
This collaborative project includes digitized materials from hundreds of contributing libraries.
Text mining: request an API key.
Internet Archive Wayback Machine
Captures of websites going back to the 1990s.
Text mining: via API.