Reading texts with Big Metadata: the Bookworm Platform for Digital Texts


Benjamin MacDonald Schmidt

Fellow, Cultural Observatory @ Harvard

Ph.D. Candidate in History, Princeton University

Working in the Open:


Benjamin MacDonald Schmidt

Assistant Professor of History, Northeastern University

Bookworm: Exploring Texts through Metadata

(http://bookworm.culturomics.org)

c. 1 million books, 80 billion words

Library metadata via Open Library

Digital Public Library of America funding
Team: Harvard Cultural Observatory, Rice Cultural Observatory, Northeastern University
Martin Camacho * Neva Cherniavsky * Erez Lieberman-Aiden * JB Michel * Billy Janitsch

Guiding philosophy

  1. Digital libraries are places to watch the interaction of metadata.
  2. Metadata is about the text (whatever scale).
  3. Words and phrases are (just?) more metadata.

  • Metadata describes the world we care about
  • Huge metadata collections are worth studying on their own
  • Climatogical Metadata is Historical Data

    Textual Metadata: Newspaper Locations

    Bill Lane Center for the American West:
    http://www.stanford.edu/group/ruralwest/cgi-bin/drupal/visualizations/us_newspapers

    Textual Metadata: Correspondence Networks

    Mapping the Republic of Letters

    Textual Metadata: Trial lengths

    Data Mining with Criminal Intent/The Old Bailey Online

    Grounding Words in Texts

    Google Ngrams

    http://books.google.com/ngrams

    Google Ngrams

    http://books.google.com/ngrams

    Comparing Custom Corpora

    Bookworm Arxiv

    600,000 math and physics articles from the last 20 years

    arxiv.culturomics.org

    Bookworm ChronAm

    4 million newspaper pages, 1840-1922

    From chroniclingamerica.loc.gov

    Location, Subject, and Ethnic metadata

    Bookworm: Exploring Texts with Metadata

    The Bookworm API

  • Request data using JSN queries

  • Post using http GET requests

  • Return data in JSON and TSV

  • An example query

    
                {
                "method": "return_tsv",
                "counttype":["WordsPerMillion"],
                "search_limits": {
                "country": ["USA","UK"],
                "word": ["natural selection"]
                },
                "groups": [
                "year"
                ],
                "database": "OL"
                }
          

    The Response

    
                  year    WordsPerMillion
                  [...]
                  1907    340.20526777
                  1908    341.83114533
                  1909    295.24911692
                  1910    282.24802327
                  1911    284.92406591
                  1912    283.89805752
                  1913    296.87614627
                  1914    332.76147647
                  1915    446.39889626
                  1916    428.87396542
                  1917    527.51044740
                  1918    647.48528263
                  1919    653.05159042
                  1920    507.23177682
                  1921    501.77615474
                  [...]
          

    Interactions among metadata

    
                {
                "method": "return_tsv",
                "counttype":["WordsPerMillion"],
                "search_limits": {"word": [ "natural selection" ]},
                "groups": ["state","year"],
                "database": "OL"
                }
          
    
                state       year    WordsPerMillion
                [...]
                NJ  1901    0E-8
                NJ  1902    0E-8
                NJ  1903    0.52162392
                NJ  1904    0E-8
                NJ  1905    0E-8
                NJ  1906    0E-8
                NJ  1907    0.52719259
                NJ  1908    0.59582825
                NJ  1909    0.23120944
                NJ  1910    1.08461634
                [...]
          

    Returning Words As Metadata

    
                {
                "method": "return_tsv",
                "counttype":["WordCount"],
                "search_limits": {
                "year":[1877],
                "state":["RI"]
                },
                "groups": ["unigram","year"],
                "database": "presidio"}
          
    
                unigram     year    WordCount
                [...]
                resolve     1877    8
                resolved    1877    272
                resolves    1877    10
                resolving   1877    2
                resort      1877    10
                resorted    1877    2
                resorts     1877    4
                resound     1877    1
                [...c. 23,000 total rows...]
          

    Using Cities as Words and Places

    Using Individual Newspaper Locations

    Rate of mentions of Topeka (each dot is one newspaper)

    States and Regions are both as important as distance

    Does race change imagined geographies?

    Does race change imagined geographies?

    Bookworm ChronAm

    4 million newspaper pages, 1840-1922

    From chroniclingamerica.loc.gov

    Location, Subject, and Ethnic metadata

    #