What are Books About?

2019-11-06

Card Catalogs

1897

Jefferson’s Classification

Organization of cataloging at the Library of Congress, 1909

J. Edgar Hoover, 1916

Young Hoover

FBI overflow files, 1944

Hoover files

Getty Images/Mashable

Scanned Books

A MARC record

<record>
        <leader>00820nam a22002291  4500</leader>
        <controlfield tag="001">006496938</controlfield>
        <controlfield tag="003">MiAaHDL</controlfield>
        <controlfield tag="005">20130926000000.0</controlfield>
        <controlfield tag="006">m        d        </controlfield>
        <controlfield tag="007">cr bn ---auaua</controlfield>
        <controlfield tag="008">880505s1927    ksu           00110 eng  </controlfield>
        <datafield tag="010" ind1=" " ind2=" ">
            <subfield code="a">   27024000</subfield>
        </datafield>
        <datafield tag="035" ind1=" " ind2=" ">
            <subfield code="a">sdr-nrlfGLAD17073443-B</subfield>
        </datafield>
        <datafield tag="035" ind1=" " ind2=" ">
            <subfield code="a">(OCoLC)6046903</subfield>
        </datafield>
        <datafield tag="040" ind1=" " ind2=" ">
            <subfield code="a">DLC</subfield>
            <subfield code="c">OKN</subfield>
            <subfield code="d">CUY</subfield>
            <subfield code="d">ZEPHIR</subfield>
        </datafield>
        <datafield tag="050" ind1="0" ind2=" ">
            <subfield code="a">QB63</subfield>
            <subfield code="b">.B5 1927</subfield>
        </datafield>
        <datafield tag="090" ind1=" " ind2=" ">
            <subfield code="a"> QR63</subfield>
            <subfield code="b">.B5</subfield>
        </datafield>
            (etc...)

Library Cataloging

Turning machine-readable books into machine-read books for classification.

Humans have rich, fuzzy understandings with allowance for uncertainty.

Computers force things into lifeless abstractions.

Humans, have rich, fuzzy understandings with allowance for uncertainty.

Bureaucracies force things into lifeless abstractions.

Machine learning models (nowadays) have fuzzy understandings with allowance for uncertainty.

Prediction: Short, computer-readable embeddings of collections items will be an increasingly important shared resource for non-consumptive digital scholarship.

HathiTrust Research Center Extracted Features

15 million books, page level counts

Libraries in R, Javascript, Python.

Online Sandbox

5TB of data. Tens of millions of words in dozens of languages.

What’s the best way to reduce dimensionality?

Principal Components Analysis on the Term-Document matrix
Latent Semantic Indexing
Semantic Hashing, etc.

How do we reduce dimensionality?

PCA on the words we can count.
Top-n words.
Topic models.

SRP steps

Begin with wordcounts

SRP steps

Take SHA-1 hashes for all words.

(Because SHA-1 is available everywhere).

eg: “bank”

bdd240c8fe7174e6ac1cfdd5282de76eb7ad6815

1011110111010010010000001100100011111110011100010111010011100110101
0110000011100111111011101010100101000001011011110011101101110101101
1110101101011010000001010111101101010000011110100111110111000101001
01101111100111110000100111101100010101101101101110101101101010

SRP steps

Cast the binary hash to [1, -1], (Achlioptas) generating a reproducible quasi-random projection matrix that will retain document-wise distances in the reduced space.

Put formally

D = Document, i = SRP dimension, W = Vocab size, h = binary SHA hash, w = list of vocabulary, c = counts of vocabulary.

Put informally.

Each dimension is the sum of the wordcounts for a random half the words, minus the sum of the wordcounts for the other half.
Words that have the similar vocabulary will be closer on all the dimensions.

A final trick

Word counts are log transformed so common words don’t overwhelm dimensions.

Zoomable Visualization

Fiction Bibliography

How-to

Subsections

Reproducing Classifications

Classifier suites:

Re-usable batch training code in TensorFlow.
One-hidden-layer neural networks can help transfer metadata between corpora.
Protocol: 90% training, 5% validation, 5% test.
Books only (no serials).
All languages at once.

Classifiers trained on Hathi metadata can predict:

Language
Authorship on top 1,000 authors with > 95% accuracy. (Too good to be true)
Presence of multiple subject heading components (eg: ‘650z: Canada – Quebec – Montreal’) with ~50% precision and ~30% recall.
Year of publication for books with median errors ~ 4 years.

Library of Congress Classification

Shelf locations of books.
Widely used by research libraries in United States.
~220 “subclasses” at first level of resolution.

Instances	Class name (randomly sampled from full population)
461	AI [Periodical] Indexes
6986	BD Speculative philosophy
9311	BJ Ethics
40335	DC [History of] France - Andorra - Monaco
2738	DJ [History of the] Netherlands (Holland)
14928	G GEOGRAPHY. ANTHROPOLOGY. RECREATION [General class]
17353	HN Social history and conditions. Social problems. Social reform
4703	JV Colonies and colonization. Emigration and immigration. International migration
23	KB Religious law in general. Comparative religious law. Jurisprudence
5583	LD [Education:] Individual institutions - United States
3496	NX Arts in general
6222	PF West Germanic languages
68144	PG Slavic languages and literatures. Baltic languages. Albanian language
157246	PQ French literature - Italian literature - Spanish literature - Portuguese literature
6863	RJ Pediatrics

Chihuahua or Muffin

Misclassifications

Bacteriology

Actual: HV: Social and Public Welfare/Criminology -> Welfare -> Protection, Assistance, Relief -> Special classes.
Algorithm: US Law.

Misclassifications: mdp.39015005002905

Actual LC Classification According to Hathi: AC 277 (Undefined)

Algorithm: DC (French History)

Shelf Location in Michigan: HC 277 (Economic History, France)

Misclassifications: uva.x000423222

Actual LC Classification: BF1613 Magic (White and Black). Occult Sciences -> Shamanism. Hermetics. Necromancy -> General Works, German, post-1800

Algorithm: BP: Theosophy, etc. BP595: Works by and about Rudolf Steiner.

Misclassifications

Actual LC Classification: QH1.A43 (Natural History)

Algorithm: QE (Geology)

Actual LC Classification: QB63.B5 1927

Bacteriology

QB 63: Astronomy -> Stargazer’s guides.

QR 63: Microbiology -> Laboratory manuals

<record>
        <leader>00820nam a22002291  4500</leader>
        <controlfield tag="001">006496938</controlfield>
        <controlfield tag="003">MiAaHDL</controlfield>
        <controlfield tag="005">20130926000000.0</controlfield>
        <controlfield tag="006">m        d        </controlfield>
        <controlfield tag="007">cr bn ---auaua</controlfield>
        <controlfield tag="008">880505s1927    ksu           00110 eng  </controlfield>
        <datafield tag="010" ind1=" " ind2=" ">
            <subfield code="a">   27024000</subfield>
        </datafield>
        <datafield tag="035" ind1=" " ind2=" ">
            <subfield code="a">sdr-nrlfGLAD17073443-B</subfield>
        </datafield>
        <datafield tag="035" ind1=" " ind2=" ">
            <subfield code="a">(OCoLC)6046903</subfield>
        </datafield>
        <datafield tag="040" ind1=" " ind2=" ">
            <subfield code="a">DLC</subfield>
            <subfield code="c">OKN</subfield>
            <subfield code="d">CUY</subfield>
            <subfield code="d">ZEPHIR</subfield>
        </datafield>
        <datafield tag="050" ind1="0" ind2=" ">
            <subfield code="a">QB63</subfield>
            <subfield code="b">.B5 1927</subfield>
        </datafield>
        <datafield tag="090" ind1=" " ind2=" ">
            <subfield code="a"> QR63</subfield>
            <subfield code="b">.B5</subfield>
        </datafield>
            (etc...)

Actual LC Classification: QB63.B5 1927

Bacteriology

BF 1611: Magic (White and Black). Shamanism. Hermetics. Necromancy -> General Works
Algorithm Says: HS
HS445.A2: Freemasons -> Masonic Law -> By Region or Country -> United States -> By State -> Constitutions.

Classifier online.

http://benschmidt.org/static_class/

Fissures in classification

Can one speak meaningfully of the naturalness of a classifier?
Is there something to be gained from preserving old hierarchical schemes against the flexible schemes that have replaced them?
Is it OK to work with the digital library before we understand what’s in it?