What are Books About?

2019-11-06

Card Catalogs

1897

Jefferson’s Classification

Jefferson’s Classification

Organization of cataloging at the Library of Congress, 1909

Organization of cataloging at the Library of Congress, 1909

Organization of cataloging at the Library of Congress, 1909

Organization of cataloging at the Library of Congress, 1909

J. Edgar Hoover, 1916

Young Hoover 

FBI overflow files, 1944

Hoover files 

Getty Images/Mashable

Scanned Books

A MARC record

Library Cataloging

Turning machine-readable books into machine-read books for classification.

Humans have rich, fuzzy understandings with allowance for uncertainty.

Computers force things into lifeless abstractions.

Humans, have rich, fuzzy understandings with allowance for uncertainty.

Bureaucracies force things into lifeless abstractions.

Machine learning models (nowadays) have fuzzy understandings with allowance for uncertainty.

Prediction: Short, computer-readable embeddings of collections items will be an increasingly important shared resource for non-consumptive digital scholarship.

HathiTrust Research Center Extracted Features

15 million books, page level counts

Libraries in R, Javascript, Python.

Online Sandbox

5TB of data. Tens of millions of words in dozens of languages.

What’s the best way to reduce dimensionality?

  1. Principal Components Analysis on the Term-Document matrix
  2. Latent Semantic Indexing
  3. Semantic Hashing, etc.

How do we reduce dimensionality?

  1. PCA on the words we can count.
  2. Top-n words.
  3. Topic models.

SRP steps

  1. Begin with wordcounts

SRP steps

  1. Take SHA-1 hashes for all words.

(Because SHA-1 is available everywhere).

eg: “bank”

bdd240c8fe7174e6ac1cfdd5282de76eb7ad6815
1011110111010010010000001100100011111110011100010111010011100110101
0110000011100111111011101010100101000001011011110011101101110101101
1110101101011010000001010111101101010000011110100111110111000101001
01101111100111110000100111101100010101101101101110101101101010

SRP steps

Cast the binary hash to [1, -1], (Achlioptas) generating a reproducible quasi-random projection matrix that will retain document-wise distances in the reduced space.

Put formally

D = Document, i = SRP dimension, W = Vocab size, h = binary SHA hash, w = list of vocabulary, c = counts of vocabulary.

Put informally.

  • Each dimension is the sum of the wordcounts for a random half the words, minus the sum of the wordcounts for the other half.
  • Words that have the similar vocabulary will be closer on all the dimensions.

A final trick

Word counts are log transformed so common words don’t overwhelm dimensions.

Zoomable Visualization

Fiction Bibliography

How-to

Subsections

Reproducing Classifications

Classifier suites:

  1. Re-usable batch training code in TensorFlow.

  2. One-hidden-layer neural networks can help transfer metadata between corpora.

  3. Protocol: 90% training, 5% validation, 5% test.

  4. Books only (no serials).

  5. All languages at once.

Classifiers trained on Hathi metadata can predict:

  1. Language
  2. Authorship on top 1,000 authors with > 95% accuracy. (Too good to be true)
  3. Presence of multiple subject heading components (eg: ‘650z: Canada – Quebec – Montreal’) with ~50% precision and ~30% recall.
  4. Year of publication for books with median errors ~ 4 years.

Library of Congress Classification

  • Shelf locations of books.
  • Widely used by research libraries in United States.
  • ~220 “subclasses” at first level of resolution.

Instances Class name (randomly sampled from full population)
461 AI [Periodical] Indexes
6986 BD Speculative philosophy
9311 BJ Ethics
40335 DC [History of] France - Andorra - Monaco
2738 DJ [History of the] Netherlands (Holland)
14928 G GEOGRAPHY. ANTHROPOLOGY. RECREATION [General class]
17353 HN Social history and conditions. Social problems. Social reform
4703 JV Colonies and colonization. Emigration and immigration. International migration
23 KB Religious law in general. Comparative religious law. Jurisprudence
5583 LD [Education:] Individual institutions - United States
3496 NX Arts in general
6222 PF West Germanic languages
68144 PG Slavic languages and literatures. Baltic languages. Albanian language
157246 PQ French literature - Italian literature - Spanish literature - Portuguese literature
6863 RJ Pediatrics

Chihuahua or Muffin

Misclassifications

Bacteriology 

  • Actual: HV: Social and Public Welfare/Criminology -> Welfare -> Protection, Assistance, Relief -> Special classes.
  • Algorithm: US Law.

Misclassifications: mdp.39015005002905

 

  • Actual LC Classification According to Hathi: AC 277 (Undefined)
  • Algorithm: DC (French History)
  • Shelf Location in Michigan: HC 277 (Economic History, France)

Misclassifications: uva.x000423222

 

  • Actual LC Classification: BF1613 Magic (White and Black). Occult Sciences -> Shamanism. Hermetics. Necromancy -> General Works, German, post-1800
  • Algorithm: BP: Theosophy, etc. BP595: Works by and about Rudolf Steiner.

Misclassifications

 

  • Actual LC Classification: QH1.A43 (Natural History)
  • Algorithm: QE (Geology)

Actual LC Classification: QB63.B5 1927

Bacteriology 

  • QB 63: Astronomy -> Stargazer’s guides.
  • QR 63: Microbiology -> Laboratory manuals

Actual LC Classification: QB63.B5 1927

Bacteriology 

  • BF 1611: Magic (White and Black). Shamanism. Hermetics. Necromancy -> General Works
  • Algorithm Says: HS
  • HS445.A2: Freemasons -> Masonic Law -> By Region or Country -> United States -> By State -> Constitutions.

Classifier online.

http://benschmidt.org/static_class/

Fissures in classification

  1. Can one speak meaningfully of the naturalness of a classifier?

  2. Is there something to be gained from preserving old hierarchical schemes against the flexible schemes that have replaced them?

  3. Is it OK to work with the digital library before we understand what’s in it?