2019-11-06
J. Edgar Hoover, 1916
FBI overflow files, 1944
A MARC record
<record>
<leader>00820nam a22002291 4500</leader>
<controlfield tag="001">006496938</controlfield>
<controlfield tag="003">MiAaHDL</controlfield>
<controlfield tag="005">20130926000000.0</controlfield>
<controlfield tag="006">m d </controlfield>
<controlfield tag="007">cr bn ---auaua</controlfield>
<controlfield tag="008">880505s1927 ksu 00110 eng </controlfield>
<datafield tag="010" ind1=" " ind2=" ">
<subfield code="a"> 27024000</subfield>
</datafield>
<datafield tag="035" ind1=" " ind2=" ">
<subfield code="a">sdr-nrlfGLAD17073443-B</subfield>
</datafield>
<datafield tag="035" ind1=" " ind2=" ">
<subfield code="a">(OCoLC)6046903</subfield>
</datafield>
<datafield tag="040" ind1=" " ind2=" ">
<subfield code="a">DLC</subfield>
<subfield code="c">OKN</subfield>
<subfield code="d">CUY</subfield>
<subfield code="d">ZEPHIR</subfield>
</datafield>
<datafield tag="050" ind1="0" ind2=" ">
<subfield code="a">QB63</subfield>
<subfield code="b">.B5 1927</subfield>
</datafield>
<datafield tag="090" ind1=" " ind2=" ">
<subfield code="a"> QR63</subfield>
<subfield code="b">.B5</subfield>
</datafield>
(etc...)
Turning machine-readable books into machine-read books for classification.
Humans have rich, fuzzy understandings with allowance for uncertainty.
Computers force things into lifeless abstractions.
Humans, have rich, fuzzy understandings with allowance for uncertainty.
Bureaucracies force things into lifeless abstractions.
Machine learning models (nowadays) have fuzzy understandings with allowance for uncertainty.
Prediction: Short, computer-readable embeddings of collections items will be an increasingly important shared resource for non-consumptive digital scholarship.
HathiTrust Research Center Extracted Features
15 million books, page level counts
Libraries in R, Javascript, Python.
5TB of data. Tens of millions of words in dozens of languages.
What’s the best way to reduce dimensionality?
How do we reduce dimensionality?
SRP steps
SRP steps
(Because SHA-1 is available everywhere).
eg: “bank”
bdd240c8fe7174e6ac1cfdd5282de76eb7ad6815
1011110111010010010000001100100011111110011100010111010011100110101
0110000011100111111011101010100101000001011011110011101101110101101
1110101101011010000001010111101101010000011110100111110111000101001
01101111100111110000100111101100010101101101101110101101101010
SRP steps
Cast the binary hash to [1, -1]
, (Achlioptas) generating a reproducible quasi-random projection matrix that will retain document-wise distances in the reduced space.
Put formally
D = Document, i = SRP dimension, W = Vocab size, h = binary SHA hash, w = list of vocabulary, c = counts of vocabulary.
Put informally.
A final trick
Word counts are log transformed so common words don’t overwhelm dimensions.
Classifier suites:
Re-usable batch training code in TensorFlow.
One-hidden-layer neural networks can help transfer metadata between corpora.
Protocol: 90% training, 5% validation, 5% test.
Books only (no serials).
All languages at once.
Classifiers trained on Hathi metadata can predict:
Library of Congress Classification
Instances | Class name (randomly sampled from full population) |
---|---|
461 | AI [Periodical] Indexes |
6986 | BD Speculative philosophy |
9311 | BJ Ethics |
40335 | DC [History of] France - Andorra - Monaco |
2738 | DJ [History of the] Netherlands (Holland) |
14928 | G GEOGRAPHY. ANTHROPOLOGY. RECREATION [General class] |
17353 | HN Social history and conditions. Social problems. Social reform |
4703 | JV Colonies and colonization. Emigration and immigration. International migration |
23 | KB Religious law in general. Comparative religious law. Jurisprudence |
5583 | LD [Education:] Individual institutions - United States |
3496 | NX Arts in general |
6222 | PF West Germanic languages |
68144 | PG Slavic languages and literatures. Baltic languages. Albanian language |
157246 | PQ French literature - Italian literature - Spanish literature - Portuguese literature |
6863 | RJ Pediatrics |
Misclassifications
Misclassifications: mdp.39015005002905
Misclassifications: uva.x000423222
Misclassifications
Actual LC Classification: QB63.B5 1927
<record>
<leader>00820nam a22002291 4500</leader>
<controlfield tag="001">006496938</controlfield>
<controlfield tag="003">MiAaHDL</controlfield>
<controlfield tag="005">20130926000000.0</controlfield>
<controlfield tag="006">m d </controlfield>
<controlfield tag="007">cr bn ---auaua</controlfield>
<controlfield tag="008">880505s1927 ksu 00110 eng </controlfield>
<datafield tag="010" ind1=" " ind2=" ">
<subfield code="a"> 27024000</subfield>
</datafield>
<datafield tag="035" ind1=" " ind2=" ">
<subfield code="a">sdr-nrlfGLAD17073443-B</subfield>
</datafield>
<datafield tag="035" ind1=" " ind2=" ">
<subfield code="a">(OCoLC)6046903</subfield>
</datafield>
<datafield tag="040" ind1=" " ind2=" ">
<subfield code="a">DLC</subfield>
<subfield code="c">OKN</subfield>
<subfield code="d">CUY</subfield>
<subfield code="d">ZEPHIR</subfield>
</datafield>
<datafield tag="050" ind1="0" ind2=" ">
<subfield code="a">QB63</subfield>
<subfield code="b">.B5 1927</subfield>
</datafield>
<datafield tag="090" ind1=" " ind2=" ">
<subfield code="a"> QR63</subfield>
<subfield code="b">.B5</subfield>
</datafield>
(etc...)
Actual LC Classification: QB63.B5 1927
Classifier online.
Fissures in classification
Can one speak meaningfully of the naturalness of a classifier?
Is there something to be gained from preserving old hierarchical schemes against the flexible schemes that have replaced them?
Is it OK to work with the digital library before we understand what’s in it?