Starting to Text Mine the Digitized Library with HathiTrust Features

Ben Schmidt

2020-02-06

What books are scanned?

Google

hathitrust.org

What books have been scanned?

¯_(ツ)_/¯

16 million is about 6 of these.

Two of these?

Scientific American 1911/NYPL digital collections

Some arbitrary language counts

Library Contributors

Call numbers (Library of congress)

Languages

Scanners

Google patent

Image Alignment errors: Framingham Heart Study scans

What OCR errors looked like in 2010

What OCR errors look like from Google today

Hathi’s data model.

Bibliographic Records

LCCN, OCLCid, etc.

Volumes: individual scans.

e.g., ‘pst.000020027568’

Hathifiles
www.hathitrust.org/hathifiles

Author’s Guild vs. Hathi Trust

Feature Counts:

Page-level word counts.

Feature count data: ~16,000,000 books, accssed through rsync in a pairtree format.

This takes about 4TB of storage.

Specific IDs

Python: HTRC-Feature reader. (Organisciak et al). Powerful

conda install htrc-feature-reader
pip install git+git://github.com/massivetexts/htrc-feature-reader@caching

R: “Hathidy.” Schmidt.

if (!require("remotes")) install.packages("remotes")
remotes::install_github("HumanitiesDataAnalysis/hathidy")

Javascript/Online. Suitable for quick looks.

[https://observablehq.com/@bmschmidt/book-visualizations-sandbox]

[https://observablehq.com/@bmschmidt/some-notes-on-the-statistics-of-defining-characteristic-wo]

Underwood, University of Chicago Press, 2019