Digital Text Analysis Workshop

Ben Schmidt: AHA, January 2, 2015

Online notes:

Why Digital Text Analysis?

Selecting and getting to know a corpus.

Contributing through the texts: Women Writers Project

Bill Lane Center for the American West:

Textual Metadata: Correspondence Networks

Mapping the Republic of Letters

Textual Metadata: Trial Lengths

Data Mining with Criminal Intent/The Old Bailey Online

Co-citation Networks

More data from:

Jstor Data for Research:

Ngrams: New Archaicism

Ngrams: Presses

Ngrams: 3 common words



Ngrams: 3 2044

3 2044

Google's partner libraries

Major Sources of Digital Texts

Defining Units of Analysis

Optical Character Recognition

Ted Underwood, University of Illinois

Word Clouds

History Dissertations, Robert Townsend

Named entity Recognition:

Geoparsing and geotagging

Matt Wilkens, Notre Dame: "The distribution of US city-level locations, revealing a preponderance of literary–geographic occurrences in what we would now call the Northeast corridor."

Not just tokens

Texts are "multiply addressable" (Michael Witmore)

Use metadata!

Algorithms for insight

State of the Union Comparisons

Naive Bayes

Even better: Logistic regression

Principal Components Analysis--on Library of Congress Classifications

Topic Modeling

David Blei, Princeton University


Topic Modeling literary scholarship by decade

Topic Modeling Television shows by screen time

Macroanalysis: Networks of Topics

Matthew Jockers, University of Nebraska

Go-to-software packages:

Go-to software for text analysis.

  • With just a web browser:
    • Cut and paste into an online environment: Voyant:
  • Following some arcane instructions:
    • Topic modeling and machine learning: MALLET
    • Network Analysis: Gephi
    • Data visualization: Tableau?
    • Tutorials at
  • Some programming required.
    • Cleaning and processing .txt files: Python
    • Statistical analysis: The "R" Language
    • Data visualization: R or D3 (Javascript, online)
    • Sharing libraries online: Bookworm

(reminder: this list is at

The Open Questions

Open Questions

  • Do you (or does someone else) need to program?
  • Do you need more texts than you can read?
  • If this is interdisciplinary, what discipline?