Text Analysis and Medical History

Ben Schmidt: NLM, April 13, 2016

Online notes: benschmidt.org/medhist16

Text analysis and medical history

  1. About that VM...
  2. Thinking Strategically about textual research.
  3. Finding and creating digital libraries
  4. Some useful transformations to pursue

The Virtual Machine

  1. Computer languages:
    1. Python
    2. R
  2. Specific Programs
    1. OpenCV
    2. Topic Modeling
    3. Word Embedding

Everyone should learn to code!

Fully-informed Programming

Napoleon's March on Moscow: Charles Minard 1861


web browser: localhost:8787

Why Digital Text Analysis?

Selecting and getting to know a corpus.


Contributing through the texts: Women Writers Project

Bill Lane Center for the American West: http://www.stanford.edu/group/ruralwest/cgi-bin/drupal/visualizations/us_newspapers

Textual Metadata: Correspondence Networks

Mapping the Republic of Letters

Textual Metadata: Trial Lengths

Data Mining with Criminal Intent/The Old Bailey Online

Co-citation Networks

More data from:

Jstor Data for Research: dfr.jstor.org


Ngrams: New Archaicism

Ngrams: Presses

  • BiblioBazaar

Ngrams: 3 common words



Ngrams: 3 2044

3 2044

Google's partner libraries

Words the drop off between 1918-1922 and 1923-1927

General purpose sources of digital texts

Seasonality of measles

19th-century whooping cough seems to be a winter disease

Whooping cough's patterns are all between month, not within them

"Croup" in American Newspapers

'Cough syrup in American Newspapers

Sources for medical texts

Let's figure this out.


Defining Units of Analysis

Optical Character Recognition

Ted Underwood, University of Illinois

Word Clouds

History Dissertations, Robert Townsend

Named entity Recognition: http://nlp.stanford.edu:8080/ner/process

Geoparsing and geotagging

Matt Wilkens, Notre Dame: "The distribution of US city-level locations, revealing a preponderance of literary–geographic occurrences in what we would now call the Northeast corridor."

Not just tokens

Texts are "multiply addressable" (Michael Witmore)

Use metadata!

Hathi Trust Bookworm:


Medical Heritage Library Bookworm

Experimental Index Catalog Bookworm

Geographic instead of temporal search

(Index Catalog just for europe)

Creating a corpus

Algorithms for insight

State of the Union Comparisons

Naive Bayes

  • Even better: Logistic regression.
  • Or Support Vector Machine
  • Or K-Nearest-neighbor *... There are textbooks on this stuff.

Principal Components Analysis--on Library of Congress Classifications

Topic Modeling

David Blei, Princeton University


Topic Modeling literary scholarship by decade

Go-to-software packages:

Go-to software for text analysis.

  • With just a web browser:
    • Cut and paste into an online environment: Voyant: voyant-tools.org
  • Following some arcane instructions:
    • Topic modeling and machine learning: MALLET
    • Network Analysis: Gephi
    • Data visualization: Tableau?
    • Tutorials at ProgrammingHistorian.org
  • Some programming required.
    • Cleaning and processing .txt files: Python
    • Statistical analysis: The "R" Language
    • Data visualization: R or D3 (Javascript, online)
    • Sharing libraries online: Bookworm

(reminder: this list is at benschmidt.org/medhist16)

The Open Questions

Open Questions

  • Do you (or does someone else) need to program?
  • Do you need more texts than you can read?
  • If this is interdisciplinary, what discipline?