Ben Schmidt
2020-02-06
What books have been scanned?
¯_(ツ)_/¯
16 million is about 6 of these.
Some arbitrary language counts
What OCR errors looked like in 2010
What OCR errors looked like in 2010
What OCR errors look like from Google today
What OCR errors look like from Google today
Hathi’s data model.
Bibliographic Records
LCCN, OCLCid, etc.
Volumes: individual scans.
e.g., ‘pst.000020027568’
www.hathitrust.org/hathifiles
Feature Counts:
Page-level word counts.
Feature count data: ~16,000,000 books, accssed through rsync in a pairtree format.
This takes about 4TB of storage.
Specific IDs
Python: HTRC-Feature reader. (Organisciak et al). Powerful
R: “Hathidy.” Schmidt.
if (!require("remotes")) install.packages("remotes")
remotes::install_github("HumanitiesDataAnalysis/hathidy")
Javascript/Online. Suitable for quick looks.
[https://observablehq.com/@bmschmidt/book-visualizations-sandbox]
[https://observablehq.com/@bmschmidt/some-notes-on-the-statistics-of-defining-characteristic-wo]