Bookworm: Exploring and Exposing Digital Texts through Metadata

Benjamin Schmidt
Assistant Professor of History, Northeastern University
Core Faculty, Nulab for Texts, Maps, and Networks

www.benschmidt.org

Bookworm: Exploring Texts through Metadata

(http://bookworm.culturomics.org)

c. 1 million books, 80 billion words

Library metadata via Open Library

Digital Public Library of America funding

Team: Harvard Cultural Observatory, Rice Cultural Observatory, Northeastern University
Martin Camacho * Neva Cherniavsky * Erez Lieberman-Aiden * JB Michel * Billy Janitsch

How do you share digital texts?

Images

Source: Vatican Libraries

Philosophy: Exploring Texts through Metadata

Guiding philosophy

Digital libraries are places to watch the interaction of metadata.

Metadata is about the text (whatever scale).

Words and phrases are (just?) more metadata.

Metadata reveals things about the world

Deck 701, US Maury Collection (1789-c.1865) ICOADS

Metadata reveals things about how we see the world

ICOADS Deck 735, Russian Research Vessel (R/V) Digitisation

Climatogical Metadata is Historical Data

Textual Metadata: Newspaper Locations

Bill Lane Center for the American West:
http://www.stanford.edu/group/ruralwest/cgi-bin/drupal/visualizations/us_newspapers

Textual Metadata: Correspondence Networks

Mapping the Republic of Letters

Textual Metadata: Trial lengths

Data Mining with Criminal Intent/The Old Bailey Online

Grounding Words in Texts

Google Ngrams

http://books.google.com/ngrams

Google Ngrams

http://books.google.com/ngrams

02138

Google's partner libraries shift in 1900, 1922, and later.

Text Level Indexing

Comparing Custom Corpora

Natural Selection in three genres

bookworm.culturomics.org/OL

Bookworm Arxiv

600,000 math and physics articles from the last 20 years

arxiv.culturomics.org

Mentions of US Presidents in Ngrams and the Chronicling America Bookworm

Coverage of presidential candidates in the 1896 election by candidate last name.

Coverage of all presidential elections, 1860-1922

Coverage of presidential candidates in the 1872 election

Railroad mentions in three types of texts

Railroad mileage, annual

Railroad mileage, cumulative

The Bookworm API

Specify request using JSON queries

Post using http

Return data in JSON or TSV

An example query


  {
  "method": "return_tsv",
  "counttype":["WordsPerMillion"],
  "search_limits": {
  "country": ["USA","UK"],
  "word": ["natural selection"]
  },
  "groups": [
  "year"
  ],
  "database": "OL"
  }

The Response


              year    WordsPerMillion
              [...]
              1907    340.20526777
              1908    341.83114533
              1909    295.24911692
              1910    282.24802327
              1911    284.92406591
              1912    283.89805752
              1913    296.87614627
              1914    332.76147647
              1915    446.39889626
              1916    428.87396542
              1917    527.51044740
              1918    647.48528263
              1919    653.05159042
              1920    507.23177682
              1921    501.77615474
              [...]

Interactions among metadata


        {
        "method": "return_tsv",
        "counttype":["WordsPerMillion"],
        "search_limits": {"word": [ "natural selection" ]},
        "groups": ["state","year"],
        "database": "OL"
        }


          state       year    WordsPerMillion
          [...]
          NJ  1901    0E-8
          NJ  1902    0E-8
          NJ  1903    0.52162392
          NJ  1904    0E-8
          NJ  1905    0E-8
          NJ  1906    0E-8
          NJ  1907    0.52719259
          NJ  1908    0.59582825
          NJ  1909    0.23120944
          NJ  1910    1.08461634
          [...]

Returning Words As Metadata


            {
            "method": "return_tsv",
            "counttype":["WordCount"],
            "search_limits": {
            "year":[1877],
            "state":["RI"]
            },
            "groups": ["unigram","year"],
            "database": "presidio"}


              unigram     year    WordCount
              [...]
              resolve     1877    8
              resolved    1877    272
              resolves    1877    10
              resolving   1877    2
              resort      1877    10
              resorted    1877    2
              resorts     1877    4
              resound     1877    1
              [...c. 23,000 total rows...]

http://benschmidt.org/beta/APISandbox

link

Thinking small

Visualizations without underlying wordcount information.

Rate My Professors

A documentary source.

Line charts of student-speak.

Rate My Professors

queryA = list("database"="RMP","search_limits" = list("gender"=list("female"),"rHelpful" = list("$lte" = list(2)),"department"=list("Computer Science"),"date_year" = list("$gte"=list(2005))),counttype=list("WordCount"),groups=list("unigram"))
queryB = queryA
queryB[['search_limits']][['gender']] = list("male")

goodwords = compareTwoLanguages(queryA,queryB)

historyPositive=goodwords %.% filter(!unigram %in% genderStopwords) %.% filter(-abs(dunning)0,"Female","Male"))

ggplot(historyPositive) + geom_bar(aes(y=dunning,x=reorder(unigram,abs(dunning)),fill=genderBias),stat="identity") + coord_flip() + labs(x="Word",y="Overrepresentation (Dunning Log score)")  + theme(axis.text=theme_text(size=12)) + labs(title="Gender-specific words in negative CS reviews")

Extending the platform

bookworm.culturomics.org

Instructions + free hosting for medium-sized collections

bmschmidt.github.io/Presidio

Manual Oriented towards extensions and building Bookworms

benschmidt.org/beta/APISandbox

API testing ground

github.com/bmschmidt/federalist-bookworm

All-included demo to create full bookworm from XML file of federalist papers.

Bookworm: Exploring and Exposing Digital Texts through Metadata

Benjamin Schmidt Assistant Professor of History, Northeastern University Core Faculty, Nulab for Texts, Maps, and Networks

www.benschmidt.org

Bookworm: Exploring Texts through Metadata

(http://bookworm.culturomics.org)

c. 1 million books, 80 billion words

Library metadata via Open Library

Digital Public Library of America funding

Team: Harvard Cultural Observatory, Rice Cultural Observatory, Northeastern UniversityMartin Camacho * Neva Cherniavsky * Erez Lieberman-Aiden * JB Michel * Billy Janitsch

How do you share digital texts?

Images

Source: Vatican Libraries

Philosophy: Exploring Texts through Metadata

Guiding philosophy

Digital libraries are places to watch the interaction of metadata.

Metadata is about the text (whatever scale).

Words and phrases are (just?) more metadata.

Metadata reveals things about the world

Deck 701, US Maury Collection (1789-c.1865) ICOADS

Metadata reveals things about how we see the world

ICOADS Deck 735, Russian Research Vessel (R/V) Digitisation

Climatogical Metadata is Historical Data

Textual Metadata: Newspaper Locations

Bill Lane Center for the American West:http://www.stanford.edu/group/ruralwest/cgi-bin/drupal/visualizations/us_newspapers

Textual Metadata: Correspondence Networks

Mapping the Republic of Letters

Textual Metadata: Trial lengths

Data Mining with Criminal Intent/The Old Bailey Online

Grounding Words in Texts

Google Ngrams

Google Ngrams

02138

02138

Google's partner libraries shift in 1900, 1922, and later.

Text Level Indexing

Comparing Custom Corpora

Natural Selection in three genres

bookworm.culturomics.org/OL

Bookworm Arxiv

600,000 math and physics articles from the last 20 years

arxiv.culturomics.org

Mentions of US Presidents in Ngrams and the Chronicling America Bookworm

Coverage of presidential candidates in the 1896 election by candidate last name.

Coverage of all presidential elections, 1860-1922

Coverage of presidential candidates in the 1872 election

Railroad mentions in three types of texts

Railroad mileage, annual

Railroad mileage, cumulative

The Bookworm API

Specify request using JSON queries

Post using http

Return data in JSON or TSV

An example query

The Response

Interactions among metadata

Returning Words As Metadata

http://benschmidt.org/beta/APISandbox

link

link

link

Thinking small

Visualizations without underlying wordcount information.

Rate My Professors

A documentary source.

Line charts of student-speak.

Rate My Professors

Rate My Professors

Rate My Professors

Extending the platform

Benjamin Schmidt
Assistant Professor of History, Northeastern University
Core Faculty, Nulab for Texts, Maps, and Networks

Team: Harvard Cultural Observatory, Rice Cultural Observatory, Northeastern University
Martin Camacho * Neva Cherniavsky * Erez Lieberman-Aiden * JB Michel * Billy Janitsch

Bill Lane Center for the American West:
http://www.stanford.edu/group/ruralwest/cgi-bin/drupal/visualizations/us_newspapers