Bookworm: a grammar for text analysis and visualization

Benjamin Schmidt

November 12, 2014

Acknowledgements

Acknowledgements

Institutions

  • Northeastern University/Rice University Cultural Observatory

People

  • Erez Lieberman Aiden
  • Neva Cherniavsky, Martin Camacho, Matt Nicklay, Billy Janitsch, JB Michel.

Acknowledgements

Funders

  • Digital Public Library of America
  • Harvard Cultural Observatory
  • National Endowment for the Humanities
  • github.com/Bookworm-project
  • bookworm.culturomics.org
  • benschmidt.org/moosehead

Reading Digital Libraries at scale.

Bill Lane Center for the American West: http://www.stanford.edu/group/ruralwest/cgi-bin/drupal/visualizations/us_newspaper

Textual Metadata: Correspondence Networks

Mapping the Republic of Letters

Textual Metadata: Trial Lengths

Data Mining with Criminal Intent/The Old Bailey Online

ICOADS Deck 701, US Maury Collection (1789-c.1865) ICOADS

ICOADS Deck 735, Russian Research Vessel (R/V) Digitization

books.google.com/ngrams

books.google.com/ngrams

02138

02138

3 2044

3 2044

Google's partner libraries

Bookworm Open Library

1 million books, 91 billion words

Some Bookworm instances at Northeastern and Rice:

Mentions of US Presidents in Ngrams and the Chronicling America Bookworm

  • http://benschmidt.org/slides/images/ChronAmPresidents.png

The 1896 Election: Candidates by news coverage

All Presidential elections by news coverage

The 1872 election: surprisingly interesting

Yale University: full run of Vogue Magazine

http://bookworm.library.yale.edu/collections/vogue/

Medical Heritage Library (30,000 medical books)

US State Department

Hathi Trust

http://sandbox.htrc.illinois.edu/bookworm/

Core Philosophy: texts and metadata.

How do digital libraries expose their texts?

  • Search engines
  • Full-text dumps

Bookworm Core Philosophy--infrastructures

  • Text curators can enable new uses by sharing new forms.
  • Large text collections share fundamental problems of presentation.
  • Sharing presentation forms will improve them.
  • Data visualization should be a primary output.

A single backend can drive multiple representations.

"Focusing Attention"

Bookworm Core Philosophy--data structures

  • Even humanists who don't know what it is will benefit from an expressive way to describe text analysis at the very large scale.
  • Metadata alone can enable most uses in the humanities.
  • What we think of as textual content should be reconceived of as more metadata.
  • Comparisons across sets is the most important element.

A generative grammar for texts

Why a grammar?

  • Forces restrictions on goals.
  • Avoids platform dependence.
  • Easy integration with other grammatical platforms.

Grammar, part I: Creating a corpus

"search_limits": {
    "publish_country":["United States"],
   "year": {
      "$lte": 1920,
      "$gte": 1890
   },

Grammar, Part II: Defining Text groups and tokenizations.

   {
   "search_limits": {
      "word": ["need to"]
    },
   "database":"movies",
   "aesthetic":{"x":"MovieYear","y":"WordsPerMillion"},
   "plotType":"linechart"
   }

View

Rate My Professors time chart

{
"database": "RMP",
"plotType": "barchart",
"search_limits": {
    "word": ["exams"]},
"aesthetic": {
    "x": "WordsPerMillion",
    "y": "gender"}
}

View

Exams and essays


{"database":"RMP",
    "plotType":"pointchart",
    "compare_limits":{"word":["test","tests","exam","exams","quiz","quizzes"],
        "department__id":{"$lte":20}},
    "search_limits":{"word":["essay","essays","paper","papers"],
        "department__id":{"$lte":20}},
    "aesthetic":{"x":"WordsRatio","y":"department","color":"gender"}
}

View

A new interactive site: Student assessments and gender prejudices

Grammar, Part II: Defining texts and tokens

{
"plotType": "heatmap",
"search_limits": {
   "word": ["concentrate attention"],"year":{"$lte":1922,"$gte":1850}
},
"aesthetic":{"color":"WordsPerMillion","x":"decade","y":"BenSubject"},
"database":"presidio"
}

View

Grammar, Part II: Defining texts and tokens

Time interactions.

Bicycles

Library composition

{"search_limits":{
    "publish_country":["United States"],
    "year": {
      "$lte": 1920,
      "$gte": 1890
    },
    "word":["natural selection"]
},

"compare_limits":{
    "publish_country":["United States"],
    "year": {
      "$lte": 1920,
      "$gte": 1890
    }
}

"counttype":["WordsPerMillion"]

Statistics returnable from comparing text and token counts across two collections:

  1. Percentage of Texts
  2. Uses per million words
  3. Average length of books.

Revolutionaries reading history

Civil War spikes in US history publications

The "Lincoln catechism"

Statistics returnable from comparing text and token counts across two collections:

  1. Percentage of Texts
  2. Uses per million words
  3. Average length of books.
  1. TF-IDF
  2. Dunning Log-likelihood.
  3. ???

Dunning-characteristic words by gender in reviews of history professors, Ratemyprofessor.com

Dunning-characteristic words by gender in negative reviews of history professors, Ratemyprofessor.com

Dunning-characteristic words by gender in negative reviews of history professors, Ratemyprofessor.com

Differing language in two TV shows

{"database": "movies",
"plotType": "wordcloud",
"search_limits": {
   "MovieYear":[2009],"medium":"TV show"
},
"compare_limits": {
   "MovieYear":[2009],"medium":"movie"
},
"aesthetic": {
   "label": "unigram",
   "size": "Dunning"
}}

View

Grammar part 3, alternate:

Instead of comparing, returning results.

Extensibility through a grammar

Simplest Extensions:

  1. Lemmatization
  2. Random divisions and bootstrapping.
  3. Conversion to standard vocabulary.

How Humanists use topic models badly:

  1. Only perfunctory efforts to link back into other forms of metadata.
  2. Straightforward use as dimensionality-reduction without spot checks.
  3. Assumption of stability in a single topic's composition across time, genre, etc.

Differing language in two TV shows

Topics as tokens

{
"database": "movies",
"plotType": "wordcloud",
"search_limits": {
  "TV_show": ["Seinfeld"]
},
"compare_limits": {
  "TV_show": ["Cheers"]
},
"aesthetic": {
  "label": "topic_label",
  "size": "Dunning"
}}

View

Differing language in two TV shows

Comparing in topic vocabulary:

{"database":"movies","plotType":"wordcloud",
    "search_limits":{
        "TV_show":["The Wire"],
        "topic_label":["fucking shit fuck Fuck Shit ass bitch"]},
    "compare_limits":{
        "TV_show":["Deadwood"],"topic_label":["fucking shit fuck Fuck Shit ass bitch"]},
    "aesthetic":{"label":"unigram","size":"Dunning"}
}

View

Deadwood and the Wire, Topics

An interactive tool for exploring changing topic composition

Text Encoding Initiative produces micro-texts

{"database": "ChronAm",
"plotType": "map",
"method": "return_json",
"search_limits": {},
"aesthetic": {
  "point": "placeOfPublication_geo",
  "size": "TextCount"
}}

View

In Space and time

{"database": "ChronAm",
"plotType": "map",
"method": "return_json",
"search_limits": {},
"aesthetic": {
  "point": "placeOfPublication_geo",
  "size": "TextCount",
  "time": "publish_year"}}

View

Newspaper flu coverage, 1917-1919

{"database": "ChronAm",
"plotType": "map",
"method": "return_json",
"search_limits": {"word":["flu","influenza","pneumonia"],
           "publish_year":{"$lte":1920,"$gte":1918}},
"smoothingSpan":25,
"aesthetic": {
    "time":"publish_day",
  "point": "placeOfPublication_geo",
  "size": "TextPercent"}}

View

A platform for existing research

Viral Topics

{"database": "viral",
"plotType": "map",
"method": "return_json",
"search_limits": {
  "topic": {
    "$gte": 10
  }
},
"aesthetic": {
  "point": "placeOfPublication_geo",
  "size": "WordCount",
  "time": "topic",
  "label": "topic_label"}}
{"database":"boston","plotType":"map","smoothingSpan":30,"method":"return_json","search_limits":{"word":["Dot Ave","Mass Ave"],"requested_month":{"$gt":100}},"aesthetic":{"point":"location","size":"WordCount","label":"service_name"},"counttype":["WordCount"],"groups":["location","service_name"]}

View

Conclusion

What does this tell us that we haven't seen already?

  • What do you mean by "us," kemosabe?
  • Evidence without source criticism is increasingly problematic
  • Allows sharing and elaboration of evidence
  • Allows new strategies of analysis for metadata
  • Evidence of Absence
  • Comparison of tools for inference.

Code: github.com/bookworm-project

Docs: http://bookworm-project.github.io/Docs

Plugins: github.com/benmschmidt/Bookworm-Mallet

Project Description: bookworm.culturomics.org

Hathi Browser: http://sandbox.htrc.illinois.edu/bookworm

Ben Schmidt: benschmidt.org

These slides: benschmidt.org/slides/2014-11-12_Bookworm