What do 21st century algorithms make out of 19th century collections?
2018-01-28
Outline
Who’s here?
Climate metadata, 1789-c.1860
icoads.noaa.gov
1848 6 1 3723 29038 02 4 10ISABE*_N 1 5
165 20779701 69 5 0 1 FFFFFF77AAAAAAAAAAAA 99 0 790044118480601
3714N 6937W NW 51 NW 57 NW
51 201A.STEWART NEW BEDFORD WHALING V
OYAGE 2620 199
Digitization is a multi-step process.
Matthew Maury
Confederate Navy Engraving 1862, from http://www.history.navy.mil/library/online/maury_mat_bene.htm
Abstract Logbooks
The expansion of whaling
The expansion of whaling
Logbook Digitization in the 1920s
Wallbrink, H. and F.B. Koek, Data Acquisition And Keypunching Codes For Marine Meteorological Observations At The Royal Netherlands Meteorological Institute, 1854–1968
Digitized logbooks, c. 1930
Wallbrink and Koek
Alphabet Cities
Murdoch’s album of science, via Scott Weingart
Google Ngrams
Google Ngrams
Google Ngrams
Google Ngrams
J. Edgar Hoover, 1916
FBI overflow files, 1944
A MARC record
<record>
<leader>00820nam a22002291 4500</leader>
<controlfield tag="001">006496938</controlfield>
<controlfield tag="003">MiAaHDL</controlfield>
<controlfield tag="005">20130926000000.0</controlfield>
<controlfield tag="006">m d </controlfield>
<controlfield tag="007">cr bn ---auaua</controlfield>
<controlfield tag="008">880505s1927 ksu 00110 eng </controlfield>
<datafield tag="010" ind1=" " ind2=" ">
<subfield code="a"> 27024000</subfield>
</datafield>
<datafield tag="035" ind1=" " ind2=" ">
<subfield code="a">sdr-nrlfGLAD17073443-B</subfield>
</datafield>
<datafield tag="035" ind1=" " ind2=" ">
<subfield code="a">(OCoLC)6046903</subfield>
</datafield>
<datafield tag="040" ind1=" " ind2=" ">
<subfield code="a">DLC</subfield>
<subfield code="c">OKN</subfield>
<subfield code="d">CUY</subfield>
<subfield code="d">ZEPHIR</subfield>
</datafield>
<datafield tag="050" ind1="0" ind2=" ">
<subfield code="a">QB63</subfield>
<subfield code="b">.B5 1927</subfield>
</datafield>
<datafield tag="090" ind1=" " ind2=" ">
<subfield code="a"> QR63</subfield>
<subfield code="b">.B5</subfield>
</datafield>
(etc...)
Self-advancing number stamps
Turning machine-readable books into machine-read books for classification.
Humans have rich, fuzzy understandings with allowance for uncertainty.
Computers force things into lifeless abstractions.
Humans, have rich, fuzzy understandings with allowance for uncertainty.
Bureaucracies force things into lifeless abstractions.
Computers (nowadays) have fairly rich, fuzzy understandings with allowance for uncertainty.
Prediction: Short, computer-readable embeddings of collections items will be an increasingly important shared resource for non-consumptive digital scholarship.
Rather than full text, a new method I’m calling “Stable Random Projection”:
Classifier suites:
Re-usable batch training code in TensorFlow.
One-hidden-layer neural networks can help transfer metadata between corpora.
Protocol: 90% training, 5% validation, 5% test.
Books only (no serials).
All languages at once.
Classifiers trained on Hathi metadata can predict:
Library of Congress Classification
Instances | Class name (randomly sampled from full population) |
---|---|
461 | AI [Periodical] Indexes |
6986 | BD Speculative philosophy |
9311 | BJ Ethics |
40335 | DC [History of] France - Andorra - Monaco |
2738 | DJ [History of the] Netherlands (Holland) |
14928 | G GEOGRAPHY. ANTHROPOLOGY. RECREATION [General class] |
17353 | HN Social history and conditions. Social problems. Social reform |
4703 | JV Colonies and colonization. Emigration and immigration. International migration |
23 | KB Religious law in general. Comparative religious law. Jurisprudence |
5583 | LD [Education:] Individual institutions - United States |
3496 | NX Arts in general |
6222 | PF West Germanic languages |
68144 | PG Slavic languages and literatures. Baltic languages. Albanian language |
157246 | PQ French literature - Italian literature - Spanish literature - Portuguese literature |
6863 | RJ Pediatrics |
Misclassifications
Misclassifications: mdp.39015005002905
Misclassifications: uva.x000423222
Misclassifications
Actual LC Classification: QB63.B5 1927
<record>
<leader>00820nam a22002291 4500</leader>
<controlfield tag="001">006496938</controlfield>
<controlfield tag="003">MiAaHDL</controlfield>
<controlfield tag="005">20130926000000.0</controlfield>
<controlfield tag="006">m d </controlfield>
<controlfield tag="007">cr bn ---auaua</controlfield>
<controlfield tag="008">880505s1927 ksu 00110 eng </controlfield>
<datafield tag="010" ind1=" " ind2=" ">
<subfield code="a"> 27024000</subfield>
</datafield>
<datafield tag="035" ind1=" " ind2=" ">
<subfield code="a">sdr-nrlfGLAD17073443-B</subfield>
</datafield>
<datafield tag="035" ind1=" " ind2=" ">
<subfield code="a">(OCoLC)6046903</subfield>
</datafield>
<datafield tag="040" ind1=" " ind2=" ">
<subfield code="a">DLC</subfield>
<subfield code="c">OKN</subfield>
<subfield code="d">CUY</subfield>
<subfield code="d">ZEPHIR</subfield>
</datafield>
<datafield tag="050" ind1="0" ind2=" ">
<subfield code="a">QB63</subfield>
<subfield code="b">.B5 1927</subfield>
</datafield>
<datafield tag="090" ind1=" " ind2=" ">
<subfield code="a"> QR63</subfield>
<subfield code="b">.B5</subfield>
</datafield>
(etc...)
Actual LC Classification: QB63.B5 1927
Classifier online.
Fissures in classification
Can one speak meaningfully of the naturalness of a classifier?
Is there something to be gained from preserving old hierarchical schemes against the flexible schemes that have replaced them?
Is it OK to work with the digital library before we understand what’s in it?