Bookworm: Building an expressive grammar of humanities text analysis

Ben Schmidt

Bookworm

Building an expressive grammar for humanities text analysis

Benjamin Schmidt

Assistant Professor of History, Northeastern University; Core Faculty, NuLab for Texts, Maps, and Networks

Intro

Acknowledgements

Institutions

Northeastern University/Rice University Cultural Observatory

Erez Lieberman Aiden

Neva Cherniavsky, Martin Camacho, Matt Nicklay, Billy Janitsch, JB Michel.

Acknowledgements

Funders

  • Digital Public Library of America
  • Harvard Cultural Observatory
  • National Endowment for the Humanities
  • github.com/Bookworm-project
  • bookworm.culturomics.org

Practical

Bill Lane Center for the American West: http://www.stanford.edu/group/ruralwest/cgi-bin/drupal/visualizations/us_newspapers

Textual Metadata: Correspondence Networks

Mapping the Republic of Letters

Textual Metadata: Trial Lengths

Data Mining with Criminal Intent/The Old Bailey Online

ICOADS Deck 701, US Maury Collection (1789-c.1865) ICOADS

ICOADS Deck 735, Russian Research Vessel (R/V) Digitization

books.google.com/ngrams

books.google.com/ngrams

02138

02138

Google Partner Libraries

Bookworm Open Library

1 million books, 91 billion words

Some Bookworm instances at Northeastern and Rice:

  • Science articles: bookworm.culturomics.org/arxiv
  • Movies and television: movies.benschmidt.org
  • Historical Newspapers: bookworm.culturomics.org/ChronAm

Yale University: full run of Vogue Magazine

http://bookworm.library.yale.edu/collections/vogue/

Medical Heritage Library (30,000 medical books)

US State Department

Hathi Trust

http://sandbox.htrc.illinois.edu/bookworm/

Abstract

II. A grammar of text analysis

Bookworm Core Philosophy--infrastructures

  • Text curators can enable new uses by sharing new forms.
  • Large text collections share fundamental problems of presentation.
  • Sharing presentation forms will improve them.
  • Data visualization is a primary output.

A single backend can drive multiple representations.

"Focusing Attention"

Bookworm Core Philosophy--data structures

  • Even humanists who don't what it is will benefit from an expressive way to describe text analysis at the very large scale
  • Metadata alone can enable most uses in the humanities.
  • What we think of as textual content should be reconceived of as more metadata.
  • Comparisons across sets is the most important element.

An API vs a grammar

Grammar, part I: Creating a corpus

"search_limits": {
    "publish_country":["United States"],
   "year": {
      "$lte": 1920,
      "$gte": 1890
   },
   "word": ["focus attention"]
},"compare_limits":{
    "publish_country":["United States"],
   "year": {
      "$lte": 1920,
      "$gte": 1890
   }
}

Grammar, Part II: Defining Text groups and tokenizations.

{"search_limits": {
   "word": ["need to"]},
   "database":"movies",
   "aesthetic":{"x":"MovieYear","y":"WordsPerMillion"},
   "plotType":"linechart"
   }

Grammar, Part II: Defining texts and tokens

Concentrating Attention

{
"plotType": "heatmap",
"search_limits": {
   "word": ["concentrate attention"]
},
"counttype": ["WordsPerMillion"],
"groups": ["decade", "subject"]}

Grammar, Part II:

Bicycles

[images/bicycle.png]

Library composition

Statistics returnable from comparing text and token counts across two collections:

  1. Percentage of Texts
  2. Uses per million words
  3. Average length of books.
  4. TF-IDF
  5. Dunning Log-likelihood.
  6. ???

Dunning-characteristic words by gender in reviews of history professors, Ratemyprofessor.com

Dunning-characteristic words by gender in negative reviews of history professors, Ratemyprofessor.com

Dunning-characteristic words by gender in negative reviews of history professors, Ratemyprofessor.com

A grammar promotes extensibility.

Practical: a grammar promotes critical combination of different tools..

How Humanists use topic models badly:

  1. Only perfunctory efforts to link back into other forms of metadata.
  2. Straightforward use as dimensionality-reduction without spot checks.
  3. Assumption of stability in a single topic's composition across time, genre, etc.

Differing language in two TV shows

{"database": "movies",
"plotType": "wordcloud",
"search_limits": {
   "TV_show": ["Seinfeld"]
},
"compare_limits": {
   "TV_show": ["Cheers"]
},
"aesthetic": {
   "label": "unigram",
   "size": "Dunning"
}}

Differing language in two TV shows

Topics as tokens

{
"database": "movies",
"plotType": "wordcloud",
"search_limits": {
  "TV_show": ["Seinfeld"]
},
"compare_limits": {
  "TV_show": ["Cheers"]
},
"aesthetic": {
  "label": "unigram",
  "size": "Dunning"
}}

Differing language in two TV shows

Deadwood and the Wire, Topics

Deadwood and the Wire, Topics

Text Encoding Initiative produces micro-texts

Newspapers in Space

{"database": "ChronAm",
"plotType": "map",
"method": "return_json",
"search_limits": {},
"aesthetic": {
  "point": "placeOfPublication_geo",
  "size": "TextCount"
}}

In Space and time

{"database": "ChronAm",
"plotType": "map",
"method": "return_json",
"search_limits": {},
"aesthetic": {
  "point": "placeOfPublication_geo",
  "size": "TextCount",
  "time": "publish_year"}}

Newspaper flu coverage, 1917-1919

{"database": "ChronAm",
"plotType": "map",
"method": "return_json",
"search_limits": {"word":["flu","influenza"],
           "publish_year":{"$lte":1920,"$gte":1917}},
"smoothingSpan":25,
"aesthetic": {
    "time":"publish_day",
  "point": "placeOfPublication_geo",
  "size": "TextPercent"}}

A Viral Text

{"database": "viral",
"plotType": "map",
"method": "return_json",
"search_limits": {
  "chunk": [6]
},
"aesthetic": {
  "point": "placeOfPublication_geo",
  "size": "WordCount",
  "time": "date_year"}}

Viral Texts: Ryan Cordell and David Smith, Northeastern University

Viral Topics

{"database": "viral",
"plotType": "map",
"method": "return_json",
"search_limits": {
  "topic": {
    "$gte": 10
  }
},
"aesthetic": {
  "point": "placeOfPublication_geo",
  "size": "WordCount",
  "time": "topic",
  "label": "topic_label"}}

Conclusion

What does this tell us that we haven't seen already?

  • What do you mean by "us," kemosabe?
  • Evidence without source criticism is increasingly problematic
  • Allows sharing and elaboration of evidence
  • Allows new strategies of analysis for metadata
  • Evidence of Absence
  • Comparison of tools for inference.

Code: github.com/bookworm-project

Docs: http://bookworm-project.github.io/Docs

Plugins: github.com/benmschmidt/Bookworm-Mallet

Project Description: bookworm.culturomics.org

Hathi Browser: http://sandbox.htrc.illinois.edu/bookworm

Ben Schmidt: benschmidt.org

These slides: benschmidt.org/slides/Dartmouth