Digital Text Analysis Workshop

Ben Schmidt: AHA, January 2, 2015

Online notes: benschmidt.org/AHA.pdf

Why Digital Text Analysis?

Selecting and getting to know a corpus.

corpus.byu.edu

Contributing through the texts: Women Writers Project

Bill Lane Center for the American West: http://www.stanford.edu/group/ruralwest/cgi-bin/drupal/visualizations/us_newspapers

Textual Metadata: Correspondence Networks

Mapping the Republic of Letters

Textual Metadata: Trial Lengths

Data Mining with Criminal Intent/The Old Bailey Online

Co-citation Networks

More data from:

Jstor Data for Research: dfr.jstor.org

dfr.jstor.org

Ngrams: New Archaicism

Ngrams: Presses

Ngrams: 3 common words

02138

02138

Ngrams: 3 2044

3 2044

Google's partner libraries

Major Sources of Digital Texts

Defining Units of Analysis

Optical Character Recognition

Ted Underwood, University of Illinois

Word Clouds

History Dissertations, Robert Townsend

Named entity Recognition: http://nlp.stanford.edu:8080/ner/process

Geoparsing and geotagging

Matt Wilkens, Notre Dame: "The distribution of US city-level locations, revealing a preponderance of literary–geographic occurrences in what we would now call the Northeast corridor."

Not just tokens

Texts are "multiply addressable" (Michael Witmore)

Use metadata!

http://sandbox.htrc.illinois.edu/bookworm/

Algorithms for insight

State of the Union Comparisons

Naive Bayes

Even better: Logistic regression

Principal Components Analysis--on Library of Congress Classifications

Topic Modeling

David Blei, Princeton University

Mallet

Topic Modeling literary scholarship by decade

Topic Modeling Television shows by screen time

Macroanalysis: Networks of Topics

Matthew Jockers, University of Nebraska

Go-to-software packages:

Go-to software for text analysis.

  • With just a web browser:
    • Cut and paste into an online environment: Voyant: voyant-tools.org
  • Following some arcane instructions:
    • Topic modeling and machine learning: MALLET
    • Network Analysis: Gephi
    • Data visualization: Tableau?
    • Tutorials at ProgrammingHistorian.org
  • Some programming required.
    • Cleaning and processing .txt files: Python
    • Statistical analysis: The "R" Language
    • Data visualization: R or D3 (Javascript, online)
    • Sharing libraries online: Bookworm

(reminder: this list is at benschmidt.org/.pdf)

The Open Questions

Open Questions

  • Do you (or does someone else) need to program?
  • Do you need more texts than you can read?
  • If this is interdisciplinary, what discipline?