The Library is open for students, staff, and faculty of The Claremont Colleges, as well as community card holders. Masks are required and eating is not permitted. See COVID-19 Services and Updates for more information.
Text and data mining (TDM) is an umbrella term to describe a broad array of methods, tools, and approaches to scholarship that involve applying computational methods to large bodies of (often unstructured) text.
TDM approaches are increasingly popular in an era of large-scale digitization efforts. The rapid growth of digital platforms for publication and social media have created massive corpora of textual data that is often easily accessible to anyone with internet access.
Researchers in diverse fields use TDM to gain insights from looking at long periods of time or across vast collections that would be nearly impossible with reading and examination in the traditional way. Quantifying the text elements, and developing analytical tools to count and visualize the terms of interest have opened up huge new bodies of scholarship in fields such as the digital humanities.
This guide is intended to help you get started identifying some basic tools and methods to approach a research question using TDM.
Cautions, critiques, and questions
Since TDM involves manipulations of published text that may be protected by copyright, it's important to make sure you understand the legal restrictions around the data you are working with before getting started. This can often be a complicated question. We've included some good resources here, and are happy to meet with you individually to consult on your particular project (though we cannot offer legal advice).