Doing Digital History

Benjamin Schmidt
Assistant Professor of History, Northeastern University
Core Faculty, Nulab for Texts, Maps, and Networks

www.benschmidt.org

Defining Digital History

3 Arenas for digital history

  1. Working with digital/digitized sources.
  2. Inventing and applying algorithmic methods.
  3. Publishing and Preserving electronically.

Frederick Jackson Turner's map of state boundairs, AHR Volume 1, no. 1 (1895)

Digital Text Analysis in the Humanities

The 'first' Digital Humanists

In His mercy, around 1955, God led men to invent magnetic tapes.

Robert Busa in Perspective on the Digital Humanities: Schreibman et al, 2004.

Textual Research as Positivism

Mosteller and Wallace: Authorship attribution in the federalist papers

'Stylometrics' and 'Cliometrics'

Automated Classifications

Digital as play

"The Hermeneutics of Screwing Around" (Ramsay)

Digital as play

"The Hermeneutics of Screwing Around" (Ramsay)

Play as Discovery

Digital reconfigurations for insight

Tools to democratize algorithmic exploration

  • Voyant tools

  • Play as Discovery

    Digital reconfigurations for insight

    Tools to democratize algorithmic exploration

    Big Data

    Funding

    Media Attention

    Culturomics and Google Ngrams

    Distant Reading

  • Topic Modeling

  • Network Analysis

  • Geospatial visualization/Named Entity Recognition

  • Relationships among novels

    Matthew Jockers and Elijah Meeks

    Distant Reading

  • Topic Modeling

  • Network Analysis

  • Geospatial visualization/Named Entity Recognition

  • Relationships among novels

    Matthew Jockers and Elijah Meeks

    Distant Reading

  • Topic Modeling

  • Network Analysis

  • Geospatial visualization/Named Entity Recognition

  • Relationships among novels

    Matthew Jockers and Elijah Meeks

    Textual Metadata: Newspaper Locations

    Bill Lane Center for the American West:
    http://www.stanford.edu/group/ruralwest/cgi-bin/drupal/visualizations/us_newspapers

    Textual Metadata: Correspondence Networks

    Mapping the Republic of Letters

    Textual Metadata: Trial lengths

    Data Mining with Criminal Intent/The Old Bailey Online

    Humanistic research with huge collections

    What to do with millions of texts?

  • Nothing
  • What to do with millions of texts?

    Science

    What to do with millions of texts?

    Hire Programmers

    What to do with millions of texts?

    Focused Reading

    Reading Digital Sources

    1. Technical Competence
    2. Understanding of biases (Source Criticism)
    3. Technique for reading (Hermeneutics)
    4. Argument

    Texts without Authors

    Whatever vision of the digital humanities is proclaimed, it will have little place for the likes of me and for the kind of criticism I practice: a criticism that narrows meaning to the significances designed by an author, a criticism that generalizes from a text as small as half a line, a criticism that insists on the distinction between the true and the false, between what is relevant and what is noise, between what is serious and what is mere play.

    Stanley Fish

    Working with Metatadata

    Metadata shows hidden constraints: paths of ships.

    Whaling Logbooks

    Source: Baumann Rare Books
    1848 6 1     3723 29038 02 4    10ISABE*_N   1   5                                                           165 20779701 69 5 0 1                  FFFFFF77AAAAAAAAAAAA     99 0 790044118480601  3714N 6937W                                                                           NW     51 NW     57 NW     51                                          201A.STEWART       NEW BEDFORD             WHALING VOYAGE           2620 199
    

    Logbooks in Abstract

    Harvard University Library

    Logbooks as punchcards

    Wallbrink, H. and F.B. Koek, Data Acquisition And Keypunching Codes For Marine Meteorological Observations At The Royal Netherlands Meteorological Institute, 1854–1968

    Logbooks as punchcards

    Wallbrink, H. and F.B. Koek, Data Acquisition And Keypunching Codes For Marine Meteorological Observations At The Royal Netherlands Meteorological Institute, 1854–1968

    German Merchant Marine Voyages

    Ocean Navigation, 1750-1850

    Whaling Voyages

    Whaling Voyages

    Undigitized Elements

    New Bedford Whaling Museum

    Undigitized Elements

    New Bedford Whaling Museum

    Undigitized Elements

    New Bedford Whaling Museum

    Whaling Vessel Crews

    New Bedford Whaling Museum

    Physical Descriptions of Whaling Crewmembers

    New Bedford Whaling Museum

    Google Ngrams

    http://books.google.com/ngrams

    Google Ngrams

    http://books.google.com/ngrams

    Google Ngrams

    http://books.google.com/ngrams

    Google Ngrams

    http://books.google.com/ngrams

    Google Ngrams

    http://books.google.com/ngrams

    Google Ngrams

    http://books.google.com/ngrams

    02138

    02138

    Library Origins of Bookworm Volumes

    Bookworm: Exploring Texts through Metadata

    (http://bookworm.culturomics.org)

    c. 1 million books, 80 billion words

    Library metadata via Open Library

    Digital Public Library of America funding
    Team: Harvard Cultural Observatory, Rice Cultural Observatory, Northeastern University
    Martin Camacho * Neva Cherniavsky * Erez Lieberman-Aiden * JB Michel * Billy Janitsch

    Guiding philosophy

    1. Digital libraries are places to watch the interaction of metadata.
    2. Metadata is about the text (whatever scale).
    3. Words and phrases are (just?) more metadata.

    Grounding Words in Texts

    Comparing Custom Corpora

    Focus Attention

    Shifts in language happen across different temporal dimensions at once

    Cohort effects and temporal effects are evenly split

    Bookworm Arxiv

    600,000 math and physics articles from the last 20 years

    arxiv.culturomics.org

    Bookworm ChronAm

    4 million newspaper pages, 1840-1922

    From chroniclingamerica.loc.gov

    Location, Subject, and Ethnic metadata

    Bookworm: Exploring Texts with Metadata

    Using Cities as Words and Places

    Using Individual Newspaper Locations

    Rate of mentions of Topeka (each dot is one newspaper)

    States and Regions are both as important as distance

    Does race change imagined geographies?

    Does race change imagined geographies?

    The Bookworm API

  • Specify request using JSON queries

  • Post using http

  • Return data in JSON or TSV

  • An example query

    
      {
      "method": "return_tsv",
      "counttype":["WordsPerMillion"],
      "search_limits": {
      "country": ["USA","UK"],
      "word": ["natural selection"]
      },
      "groups": [
      "year"
      ],
      "database": "OL"
      }
      

    The Response

    
                  year    WordsPerMillion
                  [...]
                  1907    340.20526777
                  1908    341.83114533
                  1909    295.24911692
                  1910    282.24802327
                  1911    284.92406591
                  1912    283.89805752
                  1913    296.87614627
                  1914    332.76147647
                  1915    446.39889626
                  1916    428.87396542
                  1917    527.51044740
                  1918    647.48528263
                  1919    653.05159042
                  1920    507.23177682
                  1921    501.77615474
                  [...]
          

    Interactions among metadata

    
            {
            "method": "return_tsv",
            "counttype":["WordsPerMillion"],
            "search_limits": {"word": [ "natural selection" ]},
            "groups": ["state","year"],
            "database": "OL"
            }
            
    
              state       year    WordsPerMillion
              [...]
              NJ  1901    0E-8
              NJ  1902    0E-8
              NJ  1903    0.52162392
              NJ  1904    0E-8
              NJ  1905    0E-8
              NJ  1906    0E-8
              NJ  1907    0.52719259
              NJ  1908    0.59582825
              NJ  1909    0.23120944
              NJ  1910    1.08461634
              [...]
              

    Returning Words As Metadata

    
                {
                "method": "return_tsv",
                "counttype":["WordCount"],
                "search_limits": {
                "year":[1877],
                "state":["RI"]
                },
                "groups": ["unigram","year"],
                "database": "presidio"}
                
    
                  unigram     year    WordCount
                  [...]
                  resolve     1877    8
                  resolved    1877    272
                  resolves    1877    10
                  resolving   1877    2
                  resort      1877    10
                  resorted    1877    2
                  resorts     1877    4
                  resound     1877    1
                  [...c. 23,000 total rows...]
                  
    #