Ben Schmidt
2020-02-06
What books have been scanned?
¯_(ツ)_/¯
16 million is about 6 of these.

Some arbitrary language counts
What OCR errors looked like in 2010


What OCR errors looked like in 2010


What OCR errors look like from Google today


What OCR errors look like from Google today


Hathi’s data model.
Bibliographic Records
LCCN, OCLCid, etc.
Volumes: individual scans.
e.g., ‘pst.000020027568’
www.hathitrust.org/hathifiles
Feature Counts:
Page-level word counts.
Feature count data: ~16,000,000 books, accssed through rsync in a pairtree format.
This takes about 4TB of storage.
Specific IDs
Python: HTRC-Feature reader. (Organisciak et al). Powerful
R: “Hathidy.” Schmidt.
if (!require("remotes")) install.packages("remotes")
remotes::install_github("HumanitiesDataAnalysis/hathidy")
Javascript/Online. Suitable for quick looks.
[https://observablehq.com/@bmschmidt/book-visualizations-sandbox]
[https://observablehq.com/@bmschmidt/some-notes-on-the-statistics-of-defining-characteristic-wo]