Visualizing & Classifying full-text digital libraries

Benjamin Schmidt

2016-07-29

  1. How can data visualization help us build exploratory interfaces to huge digital libraries?
  2. How can we use digital full-text to facilitate discovery and library metadata?

Bookworm

Google Ngrams

  1. Most users are interested in some subset of books, not everything scanned by Google
  2. We really don't know what was scanned by Google

Bookworm Project

Institutions

  • Northeastern University/Rice University Cultural Observatory

People

  • Erez Lieberman Aiden
  • Neva Cherniavsky, Martin Camacho, Matt Nicklay, Billy Janitsch, JB Michel, Muhammad Shamim.

HathiTrust grant Partners

Stephen Downie, Peter Organisciak, Loretta Auvil, Colleen Fallaw, Robert McDonald

Acknowledgements

Funders

  • Digital Public Library of America
  • Harvard Cultural Observatory
  • National Endowment for the Humanities

Partners

  • Hathi Trust Research Center
  • github.com/Bookworm-project
  • benschmidt.org/slides/NYPL

Hathi Trust Bookworm at bookworm.htrc.illinois.edu

Some Bookworm instances at Northeastern and Rice:

Yale University: full run of Vogue Magazine

http://bookworm.library.yale.edu/collections/vogue/

Medical Heritage Library (80,000 medical books, and an increasing number of journals.)

http://mhlbookworm.ngrok.io/

Columbia History Lab

http://www.history-lab.org/about-collections

US State Department

Uses of bicycle, by week, in newspapers

{
"database": "ChronAm",
"plotType": "heatmap",
"search_limits": {
    "publish_year": {
        "$gte": 1860
    },
    "word": ["bicycle"]
},
"aesthetic": {
    "x": "publish_year",
    "y": "publish_week_year",
    "color": "WordsPerMillion"
}     }

View

Publication places over time of all the public domain volumes in Hathi Trust

{
"database": "hathipd",
"plotType": "map",
"method": "return_json",
"search_limits": {
    "date_year": {
        "$gte": 1800,
        "$lte": 1922
    }
},
"projection": "albers",
"aesthetic": {
    "time": "date_year",
    "point": "publication_place_geo",
    "size": "TextCount"
}
}

View

Newspaper flu coverage, 1917-1919, by day

{"database": "ChronAm",
"plotType": "map",
"method": "return_json",
"search_limits": {"word":["flu","influenza","pneumonia"],
           "publish_year":{"$lte":1920,"$gte":1918}},
"smoothingSpan":25,
"aesthetic": {
    "time":"publish_day",
  "point": "placeOfPublication_geo",
  "size": "TextPercent"}}

View

Contextual Reading of State of the Union Addresses

benschmidt.org/profGender

Embedding texts

Why embed?

  1. Vectorized documents are a necessary input to algorithms.
  2. For Finding similar books
  3. For correcting faulty metadata
    1. More important now that metadata is being used as the only discovery method for books.

There are too many words.

Map of the Hathi Trust/

What's the best way to reduce dimensionality?

  1. Principal Components Analysis on the Term-Document matrix
  2. Latent Semantic Indexing
  3. Topic Modeling
  4. Locality Sensitive Hashing (Semantic Hashing, etc.)

Reasons not to use the best methods.

  1. Computationally intractable: full matrix is trillions of rows, sparse matrix is billions.
  2. Difficult to distribute; users at home can't embed documents in your space.
  3. Any embedded space is optimal only for the text collection it was trained on.

Stable random projection (SRP) Process

  1. Start with word frequencies.
  2. Perform a log-transformation on the frequencies.
  3. Choose an output dimensionality
  4. Randomly project those frequencies into a lower dimensional space (using SHA-1 hashes) to make the process reproducible.

Neighbor searches for full-text

Identifying neighbors

Classifying subjects

Why classify automatically?

  1. Make discovery easier on unlabelled collections.
  2. Identify mistakes in metadata.
  3. Better understand biases and prejudices of standard knowledge organization.

The Library of Congress Classification

Organization of cataloging at the Library of Congress, 1909

Organization of cataloging at the Library of Congress, 1909

Success by language

{
"database": "hathipd",
"plotType": "barchart",
"method": "return_json",
"search_limits": {
    "languages__id": {"$lte": 10},
    "LCC_guess_is_correct": ["True"]
},
"compare_limits": {
    "languages__id": {"$lte": 10 },
    "LCC_guess_is_correct": ["False", "True"]
},
"aesthetic": {
    "x": "TextPercent",
    "y": "languages"
}
}

View

Error by subclass

{
"database": "hathipd",
"plotType": "barchart",
"method": "return_json",
"search_limits": {
    "LCC_guess_is_correct": ["False"]
},
"compare_limits": {
    "LCC_guess_is_correct": ["False", "True"]
},
"aesthetic": {
    "x": "TextPercent",
    "y": "lc_classes"
}
}

View

NYPL classed

Online classifier for anything

Classifying dates

Classifying dates

Representing time as a ratchet;

To encode 1985 in the range 1980-1990: [0,0,0,0,0,1,1,1,1,1]

Classification of Lucretius.

Blue line is classifier probability for each year Red vertical line is actual date. Outer bands are 90% confidence.

Next steps

  • Subject Headings (much denser)
  • Individual research collections (in or out)
  • Geographical subject areas
  • OCR error
  • Full-vocabulary models, not projected.
  • SRP with bigrams and trigrams.