Bookworm: a grammar for text analysis and visualization

Benjamin Schmidt

February 25, 2015

Acknowledgements

Acknowledgements

Institutions

  • Northeastern University/Rice University Cultural Observatory

People

  • Erez Lieberman Aiden
  • Neva Cherniavsky, Martin Camacho, Matt Nicklay, Billy Janitsch, JB Michel.

Acknowledgements

Funders

  • Digital Public Library of America
  • Harvard Cultural Observatory
  • National Endowment for the Humanities

Partners

  • Hathi Trust Research Center
  • github.com/Bookworm-project
  • bookworm.culturomics.org
  • benschmidt.org/moosehead

Reading Digital Libraries at scale.

Bill Lane Center for the American West: http://www.stanford.edu/group/ruralwest/cgi-bin/drupal/visualizations/us_newspaper

Textual Metadata: Correspondence Networks

Mapping the Republic of Letters

Textual Metadata: Trial Lengths

Data Mining with Criminal Intent/The Old Bailey Online

books.google.com/ngrams

books.google.com/ngrams

02138

02138

3 2044

3 2044

Google's partner libraries

Bookworm Open Library

1 million books, 91 billion words

Some Bookworm instances at Northeastern and Rice:

Mentions of US Presidents in Ngrams and the Chronicling America Bookworm

  • http://benschmidt.org/slides/images/ChronAmPresidents.png

The 1896 Election: Candidates by news coverage

All Presidential elections by news coverage

The 1872 election: surprisingly interesting

Yale University: full run of Vogue Magazine

http://bookworm.library.yale.edu/collections/vogue/

Medical Heritage Library (30,000 medical books)

US State Department

Hathi Trust

http://sandbox.htrc.illinois.edu/bookworm/

Core Philosophy: texts and metadata.

How do digital libraries expose their texts?

  • Search engines
  • Full-text dumps

Bookworm Core Philosophy--infrastructures

  • Text curators can enable new uses by sharing new forms.
  • Large text collections share fundamental problems of presentation.
  • Sharing presentation forms will improve them.
  • Data visualization should be a primary output.

A single backend can drive multiple representations.

"Focusing Attention"

Bookworm Core Philosophy--data structures

  • Even humanists who don't know what it is will benefit from an expressive way to describe text analysis at the very large scale.
  • Metadata alone can enable most uses in the humanities.
  • What we think of as textual content should be reconceived of as more metadata.
  • Comparisons across sets is the most important element.

A generative grammar for texts

Why a grammar?

  • Forces restrictions on goals.
  • Avoids platform dependence.
  • Easy integration with other grammatical platforms.

Grammar, part I: Creating a corpus

"search_limits": {
    "publish_country":["United States"],
   "year": {
      "$lte": 1920,
      "$gte": 1890
   },

Grammar, Part II: Defining Text groups and tokenizations.

   {
   "search_limits": {
      "word": ["need to"]
    },
   "database":"movies",
   "aesthetic":{"x":"MovieYear","y":"WordsPerMillion"},
   "plotType":"linechart"
   }

View

   {
   "search_limits": {
      "word": ["iPhone"],"date_year":{"$gte":2002,"$lte":2015}
    },
   "database":"RMP",
   "aesthetic":{"x":"date_year","y":"WordsPerMillion"},
   "plotType":"linechart"
   }

View

{
"database": "RMP",
"plotType": "barchart",
"search_limits": {
    "word": ["exams"]},
"aesthetic": {
    "x": "WordsPerMillion",
    "y": "gender"}
}

View

Exams and essays


{"database":"RMP",
    "plotType":"pointchart",
    "compare_limits":{"word":["test","tests","exam","exams","quiz","quizzes"],
        "department__id":{"$lte":20}},
    "search_limits":{"word":["essay","essays","paper","papers"],
        "department__id":{"$lte":20}},
    "aesthetic":{"x":"WordsRatio","y":"department","color":"gender"}
}

View

Student assessments and gender prejudices

Grammar, Part II: Defining texts and tokens

{
"plotType": "heatmap",
"search_limits": {
   "word": ["concentrate attention"],"year":{"$lte":1922,"$gte":1850}
},
"aesthetic":{"color":"WordsPerMillion","x":"decade","y":"BenSubject"},
"database":"presidio"
}

View

Grammar, Part II: Defining texts and tokens

Time interactions.

Bicycles

Library composition

{"search_limits":{
    "publish_country":["United States"],
    "year": {
      "$lte": 1920,
      "$gte": 1890
    },
    "word":["natural selection"]
},

"compare_limits":{
    "publish_country":["United States"],
    "year": {
      "$lte": 1920,
      "$gte": 1890
    }
}

"counttype":["WordsPerMillion"]

Statistics returnable from comparing text and token counts across two collections:

  1. Percentage of Texts
  2. Uses per million words
  3. Average length of books.

Revolutionaries reading history

Civil War spikes in US history publications

The "Lincoln catechism"

Statistics returnable from comparing text and token counts across two collections:

  1. Percentage of Texts
  2. Uses per million words
  3. Average length of books.
  1. TF-IDF
  2. Dunning Log-likelihood.
  3. ???

Dunning-characteristic words by gender in reviews of history professors, Ratemyprofessor.com

Dunning-characteristic words by gender in negative reviews of history professors, Ratemyprofessor.com

Dunning-characteristic words by gender in negative reviews of history professors, Ratemyprofessor.com

Grammar part 3, alternate:

Instead of comparing, returning results.

Extensibility through a grammar

Simplest Extensions:

  1. Lemmatization
  2. Random divisions and bootstrapping.
  3. Conversion to standard vocabulary.

How Humanists use topic models badly:

  1. Only perfunctory efforts to link back into other forms of metadata.
  2. Straightforward use as dimensionality-reduction without spot checks.
  3. Assumption of stability in a single topic's composition across time, genre, etc.

Comparing the Vocabulary of two TV shows

{
"database": "movies",
"plotType": "worddiv",
"search_limits": {
  "TV_show": ["Seinfeld"]
},
"compare_limits": {
  "TV_show": ["Cheers"]
},
"aesthetic": {
  "label": "unigram",
  "size": "Dunning"
}}

View

Replacing "words" with "topics"

{
"database": "movies",
"plotType": "worddiv",
"search_limits": {
  "TV_show": ["Seinfeld"]
},
"compare_limits": {
  "TV_show": ["Cheers"]
},
"aesthetic": {
  "label": "topic_label",
  "size": "Dunning"
}}

View

Comparing in topic vocabulary:

{"database":"movies","plotType":"worddiv",
    "search_limits":{
        "TV_show":["The Wire"],
        "topic_label":["fucking shit fuck Fuck Shit ass bitch"]},
    "compare_limits":{
        "TV_show":["Deadwood"],"topic_label":["fucking shit fuck Fuck Shit ass bitch"]},
    "aesthetic":{"label":"unigram","size":"Dunning"}
}

View

An interactive tool for exploring changing topic composition

Text Encoding Initiative produces micro-texts

{"database": "ChronAm",
"plotType": "map",
"method": "return_json",
"search_limits": {},
"aesthetic": {
  "point": "placeOfPublication_geo",
  "size": "TextCount"
}}

View

{"database": "ChronAm",
"plotType": "map",
"method": "return_json",
"search_limits": {},
"aesthetic": {
  "point": "placeOfPublication_geo",
  "size": "TextCount",
  "time": "publish_year"}}

View

Newspaper flu coverage, 1917-1919

{"database": "ChronAm",
"plotType": "map",
"method": "return_json",
"search_limits": {"word":["flu","influenza","pneumonia"],
           "publish_year":{"$lte":1920,"$gte":1918}},
"smoothingSpan":25,
"aesthetic": {
    "time":"publish_day",
  "point": "placeOfPublication_geo",
  "size": "TextPercent"}}

View

Mapping Places in the State of the Union

Conclusion

Code: github.com/bookworm-project

Docs: http://bookworm-project.github.io/Docs

Plugins: github.com/benmschmidt/Bookworm-Mallet

Project Description: bookworm.culturomics.org

Hathi Browser: http://sandbox.htrc.illinois.edu/bookworm

Ben Schmidt: benschmidt.org

These slides: benschmidt.org/slides/2015-02-25_Bookworm