Unintended Consequences: Using digital sources to situate changes in culture



Benjamin MacDonald Schmidt

Fellow, Cultural Observatory @ Harvard

Ph.D. Candidate in History, Princeton University

Bookworm Overview

  1. High level architecture overview.
  2. The bookworms we have, right now.
  3. What the API lets you do, and how.
  4. Alternate graphical interfaces: one way forward.
  5. Towards statistical insights: some ChronAm examples of actual research to pursue.

Big Data Across the Disciplines, from Cultural Studies to Culturomics




Benjamin MacDonald Schmidt

Fellow, Cultural Observatory @ Harvard

Ph.D. Candidate in History, Princeton University

Reading texts with Big Metadata: the Bookworm Platform for Digital Texts


Benjamin MacDonald Schmidt

Fellow, Cultural Observatory @ Harvard

Ph.D. Candidate in History, Princeton University

Working in the Open:


Benjamin MacDonald Schmidt

Assistant Professor of History, Northeastern University

Measured Attention


Benjamin MacDonald Schmidt

Fellow, Cultural Observatory @ Harvard

Ph.D. Candidate in History, Princeton University

Overview



1. What sort of sources are digitized texts?

2. Writerly Mistakes and Historical Memory

3. Focusing Attention: A case study

Overview



  1. The History of Attention

  2. Reading Digital Sources

  3. Focusing Attention: A case study

        Highlit text: 3 5 6
        Ngram: 1
        Map: 1
        Network: 10 7
        Flow Chart: 1 1

        Bar: 1 2
        Cellular automata: 1

      

Github.com

Open Access Week

Ben Schmidt, History Dept.

Outline

  1. Working in Public
  2. Publishing From the Web
  3. Software

Mentions of US Presidents in Ngrams and the Chronicling America Bookworm

Joseph Priestly (1733-1804)

Modern Times (1936)

Clark Kerr, University of California

William Playfair (1759-1823)

The Bedolina Map

The Invention of Time as Space

How Bookworm Stores Data

Master Lookup Table--25 billion rows


            +--------+--------+--------+-------+
            | bookid | word1  | word2  | count |
            +--------+--------+--------+-------+
            |     31 |    589 |  55019 |     1 |
            |     38 |    101 |    708 |     1 |
            |     41 |    671 |   3341 |     1 |
            |     45 |     13 |     86 |     2 |
            |     50 |    108 |   1962 |     1 |
            |     52 |    132 |     34 |     1 |
            |     54 |    674 |     28 |     1 |
            |     54 |      2 |   5062 |     1 |
            |     56 |      7 |   1646 |     1 |
            |     58 |  17406 |   6955 |     1 |
            |     69 |   1979 |     58 |     1 |
            |     70 |    138 | 223460 |     2 |
            |     90 |      5 | 371422 |     1 |
            |     91 |   2841 |    671 |     1 |
            |     93 |   2844 |   2380 |     1 |
            |     96 |    132 |    100 |     1 |
            |    105 |     50 |    415 |     1 |
            |    106 |     18 |    385 |     1 |
            |    107 |   5072 |      4 |     1 |
            |    107 |     60 |    690 |     1 |
            |    108 |    182 |    131 |     1 |
            |    108 |   2482 |    189 |     1 |
            |    108 |      5 |  25972 |     1 |
            |    111 |      2 |   7363 |     1 |
            |    114 | 209340 |    605 |     1 |
            +--------+--------+--------+-------+

        

Metadata--900,000 rows


            +--------+------+---------+-------+-----------------------------------+
            | bookid | year | nwords  | state | title                             |
            +--------+------+---------+-------+-----------------------------------+
            | 245588 |    0 | 2613020 | NULL  | Encyclop...                       |
            | 282996 | 1997 |  165288 | AZ    | Richard Hooker and the constru... |
            | 519641 | 1997 |  219423 | AZ    | Medieval Gospel of Nicodemus...   |
            | 907596 | 1876 |   25379 | NY    | science and art of education...   |
            |  93322 | 1997 |    2466 | NY    | Clueless in Tokyo...              |
            |   7807 | 1997 |   65108 | AZ    | poems of Alcimus Ecdicius Avit... |
            | 492289 | 1996 |  123463 | KY    | McKendree College...              |
            | 987841 | 1993 |   38083 | DC    | Global climate change...          |
            | 679078 | 1993 |   46723 | DC    | North American Free Trade Agre... |
            | 244189 | 1993 |   41330 | DC    | Seafood safety...                 |
            | 609042 | 1993 |   71475 | DC    | North American Free Trade Agre... |
            |  77609 | 1993 |  100893 | DC    | Pending indoor air quality and... |
            | 108978 | 1993 |  103274 | DC    | Federal disaster policy and fu... |
            | 961066 | 1993 |   52310 | DC    | Pending nuclear legislation...    |
            | 190457 | 1993 |  111241 | DC    | Impacts of trade agreements on... |
            | 853339 | 1993 |   26997 | DC    | Developments in the Middle Eas... |
            | 818336 | 1993 |   31074 | DC    | Review of major Census Bureau ... |
            | 368923 | 1993 |   63090 | DC    | Environmental aspects of the N... |
            | 349342 | 1993 |   45708 | DC    | Operations of the Congress...     |
            | 120630 | 1993 |   47451 | DC    | Independent Counsel Reauthoriz... |
            | 286820 | 1993 |   18680 | DC    | Nomination of James Lee Witt...   |
            | 488341 | 1993 |   38158 | DC    | Impact of federal mandated mar... |
            | 298481 | 1993 |  163082 | DC    | Mineral Exploration and Develo... |
            +--------+------+---------+-------+-----------------------------------+
        

How Bookworm Stores Data

Master Lookup Table


            +--------+--------+--------+-------+
            | bookid | word1  | word2  | count |
            +--------+--------+--------+-------+
            |     31 |    589 |  55019 |     1 |
            |     38 |    101 |    708 |     1 |
            |     41 |    671 |   3341 |     1 |
            |     45 |     13 |     86 |     2 |
            |     50 |    108 |   1962 |     1 |
            |     52 |    132 |     34 |     1 |
            |     54 |    674 |     28 |     1 |
            |     54 |      2 |   5062 |     1 |
            |     56 |      7 |   1646 |     1 |
            |     58 |  17406 |   6955 |     1 |
            |     69 |   1979 |     58 |     1 |
            |     70 |    138 | 223460 |     2 |
            |     90 |      5 | 371422 |     1 |
            |     91 |   2841 |    671 |     1 |
            |    107 |   5072 |      4 |     1 |
            |    107 |     60 |    690 |     1 |
            |    108 |    182 |    131 |     1 |
            |    108 |   2482 |    189 |     1 |
            |    108 |      5 |  25972 |     1 |
            |    111 |      2 |   7363 |     1 |
            |    114 | 209340 |    605 |     1 |
            +--------+--------+--------+-------+

        

Master Words--1 million rows


            +--------+-------------+-------------+------------+----------+
            | wordid | casesens    | lowercase   | stem       | IDF      |
            +--------+-------------+-------------+------------+----------+
            |   4211 | wore        | wore        | wore       | 0.866897 |
            |   5088 | HE          | he          | he         |  1.47661 |
            |   8598 | Pieces      | pieces      | piece      |  2.72216 |
            |  19913 | Japan's     | japan's     | NULL       |  2.86523 |
            |  23351 | testament   | testament   | Testament  |  2.34616 |
            |  24504 | legged      | legged      | legs       |  1.99477 |
            |  27639 | como        | como        | como       |  3.73671 |
            |  29339 | shrinkage   | shrinkage   | shrinkage  |  2.71933 |
            |  35089 | Nina        | nina        | Nina       |  3.09251 |
            |  45784 | BERKELEY    | berkeley    | Berkeley   |  3.23562 |
            |  47416 | Heretical   | heretical   | heretics   |  5.48773 |
            |  52509 | cudgel      | cudgel      | cudgel     |  3.79054 |
            |  58293 | divino      | divino      | divino     |   4.6265 |
            |  62064 | ironing     | ironing     | iron       |  3.14454 |
            |  71846 | QUEEN'S     | queen's     | NULL       |  4.78766 |
            |  71884 | Gentoo      | gentoo      | Gentoo     |  6.16556 |
            |  77941 | attenuating | attenuating | attenuated |  4.15506 |
            |  78677 | synthetase  | synthetase  | synthetase |  4.62012 |
            |  81473 | 2020        | 2020        | NULL       |  3.40792 |
            |  81841 | quartets    | quartets    | quartet    |  4.40757 |
            |  85370 | shallop     | shallop     | shallop    |  5.08674 |
            |  87977 | warres      | warres      | warre      |  5.57029 |
            +--------+-------------+-------------+------------+----------+

        

SQL can be fast, but ugly.


          SELECT
          year,classification,
          IFNULL(numerator.WordCount,0)*100000000/IFNULL(denominator.WordCount,0)/100 as WordsPerMillion
          FROM
          (
          SELECT
          lc1,year,
          sum(main.count) as WordCount
          FROM
          fastcat

          JOIN
          master_bookcounts as main
          ON (fastcat.bookid=main.bookid)

          JOIN ( wordsheap as words1) ON (main.wordid = words1.wordid)

          WHERE
          (( ( (year<=1922) ) ) AND ( ( (year>=1850) ) ))
            AND (( ( (words1.casesens = (SELECT casesens FROM wordsheap WHERE
            casesens='library')) ) ) OR
            ( ( (words1.casesens = (SELECT casesens FROM wordsheap WHERE casesens='libraries')) ) ))
            GROUP BY
            lc1,year

            ) as numerator
            RIGHT OUTER JOIN
            (
            SELECT
            lc1,year,
            sum(nwords) as WordCount
            FROM
            fastcat


            WHERE
            (( ( (year<=1922) ) ) AND ( ( (year>=1850) ) )) AND TRUE
              GROUP BY
              lc1,year
              ) as denominator
              USING (lc1,year )
              JOIN LCC USING (lc1)
              GROUP BY lc1,year;
      

The Bookworm API

  • Specify request using JSON queries

  • Post using http

  • Return data in JSON or TSV

  • An example query

    
      {
      "method": "return_tsv",
      "counttype":["WordsPerMillion"],
      "search_limits": {
      "country": ["USA","UK"],
      "word": ["natural selection"]
      },
      "groups": [
      "year"
      ],
      "database": "OL"
      }
      

    The Response

    
                  year    WordsPerMillion
                  [...]
                  1907    340.20526777
                  1908    341.83114533
                  1909    295.24911692
                  1910    282.24802327
                  1911    284.92406591
                  1912    283.89805752
                  1913    296.87614627
                  1914    332.76147647
                  1915    446.39889626
                  1916    428.87396542
                  1917    527.51044740
                  1918    647.48528263
                  1919    653.05159042
                  1920    507.23177682
                  1921    501.77615474
                  [...]
          

    A longer API description is available here.

    Interactions among metadata

    
            {
            "method": "return_tsv",
            "counttype":["WordsPerMillion"],
            "search_limits": {"word": [ "natural selection" ]},
            "groups": ["state","year"],
            "database": "OL"
            }
            
    
              state       year    WordsPerMillion
              [...]
              NJ  1901    0E-8
              NJ  1902    0E-8
              NJ  1903    0.52162392
              NJ  1904    0E-8
              NJ  1905    0E-8
              NJ  1906    0E-8
              NJ  1907    0.52719259
              NJ  1908    0.59582825
              NJ  1909    0.23120944
              NJ  1910    1.08461634
              [...]
              

    Returning Words As Metadata

    
                {
                "method": "return_tsv",
                "counttype":["WordCount"],
                "search_limits": {
                "year":[1877],
                "state":["RI"]
                },
                "groups": ["unigram","year"],
                "database": "presidio"}
                
    
                  unigram     year    WordCount
                  [...]
                  resolve     1877    8
                  resolved    1877    272
                  resolves    1877    10
                  resolving   1877    2
                  resort      1877    10
                  resorted    1877    2
                  resorts     1877    4
                  resound     1877    1
                  [...c. 23,000 total rows...]
                  

    Library Origins of Bookworm Volumes

    Dissertation titles with years

    eg: ("Psychology in American Culture, 1890-1960")

    Bicycle Season

    Using Cities as Words and Places

    Using Individual Newspaper Locations

    Rate of mentions of Topeka (each dot is one newspaper)

    States and Regions are both as important as distance

    Does race change imagined geographies?

    Does race change imagined geographies?

    Digital as play

    "The Hermeneutics of Screwing Around" (Ramsay)

    Play as Discovery

    Digital reconfigurations for insight

    Tools to democratize algorithmic exploration

  • Voyant tools

  • Google Ngrams

    http://books.google.com/ngrams

    Google Ngrams

    http://books.google.com/ngrams

    Distant Reading

  • Topic Modeling

  • Network Analysis

  • Geospatial visualization/Named Entity Recognition

  • Relationships among novels

    Matthew Jockers and Elijah Meeks

    Texts without Authors

    Whatever vision of the digital humanities is proclaimed, it will have little place for the likes of me and for the kind of criticism I practice: a criticism that narrows meaning to the significances designed by an author, a criticism that generalizes from a text as small as half a line, a criticism that insists on the distinction between the true and the false, between what is relevant and what is noise, between what is serious and what is mere play.

    Stanley Fish

    What writers don't mean to say:
    or, 'Tis 60 (± 10) Years Since

    Google Ngrams

    books.google.com/ngrams

    The History of Non-Occurrences

    books.google.com/ngrams

    The Historical Novel

    Ordinary Individuals experiencing historical change

    Whaling Logbooks

    Source: Baumann Rare Books
    1848 6 1     3723 29038 02 4    10ISABE*_N   1   5                                                           165 20779701 69 5 0 1                  FFFFFF77AAAAAAAAAAAA     99 0 790044118480601  3714N 6937W                                                                           NW     51 NW     57 NW     51                                          201A.STEWART       NEW BEDFORD             WHALING VOYAGE           2620 199
    

    Logbooks in Abstract

    Harvard University Library

    Logbooks as punchcards

    Wallbrink, H. and F.B. Koek, Data Acquisition And Keypunching Codes For Marine Meteorological Observations At The Royal Netherlands Meteorological Institute, 1854–1968

    Logbooks as punchcards

    Wallbrink, H. and F.B. Koek, Data Acquisition And Keypunching Codes For Marine Meteorological Observations At The Royal Netherlands Meteorological Institute, 1854–1968

    Undigitized Elements

    New Bedford Whaling Museum

    Undigitized Elements

    New Bedford Whaling Museum

    Undigitized Elements

    New Bedford Whaling Museum

    Whaling Vessel Crews

    New Bedford Whaling Museum

    Physical Descriptions of Whaling Crewmembers

    New Bedford Whaling Museum

    German Merchant Marine Voyages

    Whaling Voyages

    Whaling Voyages

    Ocean Navigation, 1750-1850

    Ocean Navigation, 1750-1850

    Climatogical Metadata is Historical Data

    Shipping Routes

    Textual Metadata: Newspaper Locations

    Bill Lane Center for the American West:
    http://www.stanford.edu/group/ruralwest/cgi-bin/drupal/visualizations/us_newspapers
    "We begin to feel, Monsieur L'Abbe," answered the vicar, with some asperity, "that a Continental war entered into for the defence of an ally who was unwilling to defend himself, and for the restoration of a royal family, nobility, and priesthood who tamely abandoned their own rights, is a burden too much even for the resources of this country."

    Walter Scott, Waverley

    Anachronisms Reflect What Change is Considered "Historical"

    Mad Men with Computers.

    What to do with millions of texts?

  • Nothing
  • What to do with millions of texts?

    Science

    Harvard Cultural Observatory

    What to do with millions of texts?

    Hire Programmers

    What to do with millions of texts?

    Focused Reading

    Big Data as just another source

  • Digital sources contribute knowledge beyond individuals.
  • Metadata lets us look at social structures.
  • Once we know how to read it, we work around its biases like any other source.
  • 3 Arenas for digital history

    1. Working with digital/digitized sources.
    2. Inventing and applying algorithmic methods.
    3. Publishing and Preserving electronically.

    Tools for working with text

    1. Ngrams/Search engines
    2. Voyant Tools (voyant-tools.org)
    3. MALLET (topic modeling and machine learning)

  • Python, R, or maybe Java (real analysis)

  • Your computer's command line.

  • Historical Geography

    1. ArcGIS.
    2. QGIS.
    3. Web-based geographic visualization.

    1. Leaflet
    2. D3.js
    3. MapBox

    Working with Statistics

    1. Excel, databases.
    2. The R Language (use the "Hadleyverse")
    3. Publishing and Preserving electronically.

    --Ted Underwood, University of Illinois

    What to do with a new source

  • Source Criticism

  • What to do with a new source

  • Source Criticism

  • What to do with a new source

  • Source Criticism

  • 02138

    02138

    Google Ngrams

    http://books.google.com/ngrams

    Google Ngrams

    http://books.google.com/ngrams

    Google Ngrams

    http://books.google.com/ngrams

    What to do with a new source

  • Source Criticism

  • What to do with a new source

  • Source Criticism

  • Hermeneutics

  • Argument

  • What to do with a new source

  • Source Criticism

  • Hermeneutics

  • Argument

  • What do we need to describe large data sets?

  • Basic Technical Skills

  • Understanding of biases (Source Criticism)

  • Way of reading (Hermeneutics)

  • Argument

  • "Big data" needs humanists

    Reading Digital Sources

    1. Technical Competence
    2. Understanding of biases (Source Criticism)
    3. Technique for reading (Hermeneutics)
    4. Argument

    Reading Digital Sources

    1. Technical Competence
    2. Understanding of biases (Source Criticism)

    Reading Digital Sources

    1. Technical Competence
    2. Understanding of biases (Source Criticism)
    3. Technique for reading (Hermeneutics)
    4. Argument

    Textual Metadata: Trial lengths

    Data Mining with Criminal Intent/The Old Bailey Online

    Textual Metadata: Correspondence Networks

    Mapping the Republic of Letters

    How should historians approach digitization

  • 1. Improve our digital copies

  • 2. Treat them as new sources altogether, and see what they're good for.

  • Grounding Words in Texts

    Bookworm: Exploring Texts through Metadata

    (http://bookworm.culturomics.org)

    c. 1 million books, 80 billion words

    Library metadata via Open Library

    Digital Public Library of America funding
    Team: Harvard Cultural Observatory, Rice Cultural Observatory, Northeastern University
    Martin Camacho * Neva Cherniavsky * Erez Lieberman-Aiden * JB Michel * Billy Janitsch

    Comparing Custom Corpora

    Bookworm: Exploring Texts with Metadata

    Bookworm Arxiv

    600,000 math and physics articles from the last 20 years

    arxiv.culturomics.org

    Bookworm JStor

    600,000 math and physics articles from the last 20 years

    arxiv.culturomics.org

    Bookworm JStor

    Just the titles of 30,000 dissertations

    New York Times

    14 million newspaper articles

    Gender tagging on authors and subjects.

    SSRN

    c 600K articles

    Matt Nicklay

    Bookworm ChronAm

    4 million newspaper pages, 1840-1922

    From chroniclingamerica.loc.gov

    Location, Subject, and Ethnic metadata

    Presidents

    1896 Election

    Presidential

    Presidential

    Bookworm: Exploring Texts with Metadata

    queryA = list("database"="RMP","search_limits" = list("gender"=list("female"),"rHelpful" = list("$lte" = list(2)),"department"=list("Computer Science"),"date_year" = list("$gte"=list(2005))),counttype=list("WordCount"),groups=list("unigram"))
    queryB = queryA
    queryB[['search_limits']][['gender']] = list("male")
    
    goodwords = compareTwoLanguages(queryA,queryB)
    
    historyPositive=goodwords %.% filter(!unigram %in% genderStopwords) %.% filter(-abs(dunning)0,"Female","Male"))
    
    ggplot(historyPositive) + geom_bar(aes(y=dunning,x=reorder(unigram,abs(dunning)),fill=genderBias),stat="identity") + coord_flip() + labs(x="Word",y="Overrepresentation (Dunning Log score)")  + theme(axis.text=theme_text(size=12)) + labs(title="Gender-specific words in negative CS reviews")
    

    Measuring Attention, c. 1890

    Focusing Attention

    bookworm.culturomics.org
  • Instructions + free hosting for medium-sized collections
  • bmschmidt.github.io/Presidio
  • Manual Oriented towards extensions and building Bookworms
  • benschmidt.org/beta/APISandbox
  • API testing ground
  • github.com/bmschmidt/federalist-bookworm
  • All-included demo to create full bookworm from XML file of federalist papers.
  • Guiding philosophy

    1. Digital libraries are places to watch the interaction of metadata.
    2. Metadata is about the text (whatever scale).
    3. Words and phrases are (just?) more metadata.

  • Metadata describes the world we care about
  • Huge metadata collections are worth studying on their own
  • The History of Attention

    The History of Attention

    Adjectives, too

    Concentrate Attention
    Focus Attention

    Attention as self-evident

    Everyone knows what attention is. It is the taking possession by the mind, in clear and vivid form, of one out of what seem several simultaneously possible objects or trains of thought.

    William James, Principles of Psychology

    Focusing Attention: a psychological metaphor

    Everyone knows what attention is. It is the taking possession by the mind, in clear and vivid form, of one out of what seem several simultaneously possible objects or trains of thought. Focalization, concentration, of consciousness are of its essence.

    William James, Principles of Psychology

    Mind as Camera

    Those who lay stress on the unity of mind regard it as almost evident a priori, that but one concept can occupy the focus of attention at a time... Attention, like the lens of the eye, is now [ie, first] accommodated to act as an instrument of near focus, high magnification, but limited aperture, and again [then] as one of distant focus, small magnifying-power, but wide range

    Can the Mind attend to two things at once? Science: July 18, 1887





    Epilogue: The Signification of the Frontier in American History

    Shifts in language happen across different temporal dimensions at once

    Cohort effects and temporal effects are evenly split

    The 'first' Digital Humanists

    In His mercy, around 1955, God led men to invent magnetic tapes.

    Robert Busa in Perspective on the Digital Humanities: Schreibman et al, 2004.

    Textual Research as Positivism

    Mosteller and Wallace: Authorship attribution in the federalist papers

    'Stylometrics' and 'Cliometrics'

    Automated Classifications

    Digital as play

    "The Hermeneutics of Screwing Around" (Ramsay)

    Play as Discovery

    Digital reconfigurations for insight

    Tools to democratize algorithmic exploration

    Big Data

    Funding

    Media Attention

    Culturomics and Google Ngrams

    Distant Reading

  • Topic Modeling

  • Network Analysis

  • Geospatial visualization/Named Entity Recognition

  • Relationships among novels

    Matthew Jockers and Elijah Meeks

    Distant Reading

  • Topic Modeling

  • Network Analysis

  • Geospatial visualization/Named Entity Recognition

  • Relationships among novels

    Matthew Jockers and Elijah Meeks

    Texts without Authors

    Whatever vision of the digital humanities is proclaimed, it will have little place for the likes of me and for the kind of criticism I practice: a criticism that narrows meaning to the significances designed by an author, a criticism that generalizes from a text as small as half a line, a criticism that insists on the distinction between the true and the false, between what is relevant and what is noise, between what is serious and what is mere play.

    Stanley Fish

    Working with Metatadata

    Metadata shows hidden constraints: paths of ships.

    #