Bookworm: Exploring and Exposing Digital Texts through Metadata

Benjamin Schmidt
Assistant Professor of History, Northeastern University
Core Faculty, Nulab for Texts, Maps, and Networks

www.benschmidt.org

Bookworm: Exploring Texts through Metadata

(http://bookworm.culturomics.org)

c. 1 million books, 80 billion words

Library metadata via Open Library

Digital Public Library of America funding
Team: Harvard Cultural Observatory, Rice Cultural Observatory, Northeastern University
Martin Camacho * Neva Cherniavsky * Erez Lieberman-Aiden * JB Michel * Billy Janitsch

How do you share digital texts?

Images

Source: Vatican Libraries

Philosophy: Exploring Texts through Metadata

Guiding philosophy

  1. Digital libraries are places to watch the interaction of metadata.
  2. Metadata is about the text (whatever scale).
  3. Words and phrases are (just?) more metadata.

Metadata reveals things about the world

Deck 701, US Maury Collection (1789-c.1865) ICOADS

Metadata reveals things about how we see the world

ICOADS Deck 735, Russian Research Vessel (R/V) Digitisation

Climatogical Metadata is Historical Data

Textual Metadata: Newspaper Locations

Bill Lane Center for the American West:
http://www.stanford.edu/group/ruralwest/cgi-bin/drupal/visualizations/us_newspapers

Textual Metadata: Correspondence Networks

Mapping the Republic of Letters

Textual Metadata: Trial lengths

Data Mining with Criminal Intent/The Old Bailey Online

Grounding Words in Texts

Google Ngrams

http://books.google.com/ngrams

Google Ngrams

http://books.google.com/ngrams

02138

02138

Google's partner libraries shift in 1900, 1922, and later.

Text Level Indexing

Comparing Custom Corpora

Natural Selection in three genres

bookworm.culturomics.org/OL

Bookworm Arxiv

600,000 math and physics articles from the last 20 years

arxiv.culturomics.org

Mentions of US Presidents in Ngrams and the Chronicling America Bookworm

Coverage of presidential candidates in the 1896 election by candidate last name.

Coverage of all presidential elections, 1860-1922

Coverage of presidential candidates in the 1872 election

Railroad mentions in three types of texts

Railroad mileage, annual

Railroad mileage, cumulative

The Bookworm API

  • Specify request using JSON queries

  • Post using http

  • Return data in JSON or TSV

  • An example query

    
      {
      "method": "return_tsv",
      "counttype":["WordsPerMillion"],
      "search_limits": {
      "country": ["USA","UK"],
      "word": ["natural selection"]
      },
      "groups": [
      "year"
      ],
      "database": "OL"
      }
      

    The Response

    
                  year    WordsPerMillion
                  [...]
                  1907    340.20526777
                  1908    341.83114533
                  1909    295.24911692
                  1910    282.24802327
                  1911    284.92406591
                  1912    283.89805752
                  1913    296.87614627
                  1914    332.76147647
                  1915    446.39889626
                  1916    428.87396542
                  1917    527.51044740
                  1918    647.48528263
                  1919    653.05159042
                  1920    507.23177682
                  1921    501.77615474
                  [...]
          

    Interactions among metadata

    
            {
            "method": "return_tsv",
            "counttype":["WordsPerMillion"],
            "search_limits": {"word": [ "natural selection" ]},
            "groups": ["state","year"],
            "database": "OL"
            }
            
    
              state       year    WordsPerMillion
              [...]
              NJ  1901    0E-8
              NJ  1902    0E-8
              NJ  1903    0.52162392
              NJ  1904    0E-8
              NJ  1905    0E-8
              NJ  1906    0E-8
              NJ  1907    0.52719259
              NJ  1908    0.59582825
              NJ  1909    0.23120944
              NJ  1910    1.08461634
              [...]
              

    Returning Words As Metadata

    
                {
                "method": "return_tsv",
                "counttype":["WordCount"],
                "search_limits": {
                "year":[1877],
                "state":["RI"]
                },
                "groups": ["unigram","year"],
                "database": "presidio"}
                
    
                  unigram     year    WordCount
                  [...]
                  resolve     1877    8
                  resolved    1877    272
                  resolves    1877    10
                  resolving   1877    2
                  resort      1877    10
                  resorted    1877    2
                  resorts     1877    4
                  resound     1877    1
                  [...c. 23,000 total rows...]
                  

    http://benschmidt.org/beta/APISandbox

    link

    link

    Thinking small

    Visualizations without underlying wordcount information.

    Rate My Professors

    A documentary source.

    Line charts of student-speak.

    Rate My Professors

    Rate My Professors

    queryA = list("database"="RMP","search_limits" = list("gender"=list("female"),"rHelpful" = list("$lte" = list(2)),"department"=list("Computer Science"),"date_year" = list("$gte"=list(2005))),counttype=list("WordCount"),groups=list("unigram"))
    queryB = queryA
    queryB[['search_limits']][['gender']] = list("male")
    
    goodwords = compareTwoLanguages(queryA,queryB)
    
    historyPositive=goodwords %.% filter(!unigram %in% genderStopwords) %.% filter(-abs(dunning)0,"Female","Male"))
    
    ggplot(historyPositive) + geom_bar(aes(y=dunning,x=reorder(unigram,abs(dunning)),fill=genderBias),stat="identity") + coord_flip() + labs(x="Word",y="Overrepresentation (Dunning Log score)")  + theme(axis.text=theme_text(size=12)) + labs(title="Gender-specific words in negative CS reviews")
    

    Extending the platform

    bookworm.culturomics.org
  • Instructions + free hosting for medium-sized collections
  • bmschmidt.github.io/Presidio
  • Manual Oriented towards extensions and building Bookworms
  • benschmidt.org/beta/APISandbox
  • API testing ground
  • github.com/bmschmidt/federalist-bookworm
  • All-included demo to create full bookworm from XML file of federalist papers.
  • #