Starting to Text Mine the Digitized Library with HathiTrust Features

Ben Schmidt

2020-02-06

What books are scanned?

Google

hathitrust.org

What books have been scanned?

¯_(ツ)_/¯

16 million is about 6 of these.


Two of these?

Scientific American 1911/NYPL digital collections
Scientific American 1911/NYPL digital collections

Scientific American 1911/NYPL digital collections

Some arbitrary language counts

  • 341 Akkadian
  • 454 Basque
  • 33,000 Danish
  • 1,596 Hausa
  • 7,505 Yiddish

Library Contributors

Call numbers (Library of congress)

Languages

Scanners

Scatterplot of 15 million books

Google patent

Image Alignment errors: Framingham Heart Study scans

What OCR errors looked like in 2010



What OCR errors looked like in 2010



What OCR errors look like from Google today



What OCR errors look like from Google today



How do you build up a list of Hathi volumes to work with?

Hathi’s data model.

Bibliographic Records

LCCN, OCLCid, etc.

Volumes: individual scans.

e.g., ‘pst.000020027568’

Hathifiles
www.hathitrust.org/hathifiles

“Feature Count” data.

Author’s Guild vs. Hathi Trust

Feature Counts:

Page-level word counts.

Feature count data: ~16,000,000 books, accssed through rsync in a pairtree format.

This takes about 4TB of storage.

Specific IDs

Python: HTRC-Feature reader. (Organisciak et al). Powerful

R: “Hathidy.” Schmidt.

if (!require("remotes")) install.packages("remotes")
remotes::install_github("HumanitiesDataAnalysis/hathidy")

Javascript/Online. Suitable for quick looks.

[https://observablehq.com/@bmschmidt/book-visualizations-sandbox]

[https://observablehq.com/@bmschmidt/some-notes-on-the-statistics-of-defining-characteristic-wo]

What sort of questions can you answer with these word counts, anyway?

Underwood, University of Chicago Press, 2019