Research Guides: Text &amp; Data Mining: Understanding Text and Data Mining

Welcome

This guide is intended to help you get started identifying some basic tools and methods to approach a research question using text and data mining (TDM). For further guidance on using TDM, please reach out to us.

Definitions

Text and data mining (TDM) is an umbrella term to describe a broad array of methods, tools, and approaches to scholarship that involve applying computational methods to large bodies of (often unstructured) text.

TDM approaches are increasingly popular in an era of large-scale digitization efforts. The rapid growth of digital platforms for publication and social media have created massive corpora of textual data that is often easily accessible to anyone with internet access.

Researchers in diverse fields use TDM to gain insights from looking at long periods of time or across vast collections that would be nearly impossible with reading and examination in the traditional way. Quantifying the text elements, and developing analytical tools to count and visualize the terms of interest have opened up huge new bodies of scholarship in fields such as the digital humanities.

Cautions, critiques, and questions

Since TDM involves manipulations of published text that may be protected by copyright, it's important to make sure you understand the legal restrictions around the data you are working with before getting started. This can often be a complicated question. We've included some good resources here, and are happy to meet with you individually to consult on your particular project (though we cannot offer legal advice).

It is important to note that the terms of use agreements that govern our use of Library databases (this will be anything you need to log in to with your Claremont credentials) do NOT allow for automated retrieval methods (such as scraping). If you are interested in downloading large numbers of articles from one of those databases, reach out to us to talk about your options before you get started.

Head, Data and Digital Scholarship Services

Jeanine Finn

she/her

Email Me

Contact:

Distinctive Collections and Digital Scholarship -
The Claremont Colleges Library

909 607-7958

Subjects: Data & Statistics, Digital Scholarship, GIS: Geographic Information System