Reordering the digital library

What do 21st century algorithms make out of 19th century collections?

2018-01-28

Outline

  1. Creating Data.
  2. What is the digital library archive?
  3. Rethinking classifications.

Who’s here?

  1. Humanities
  2. Library Science
  3. Statistics/Data/Methodological

Creating Data

Climate metadata, 1789-c.1860

 

icoads.noaa.gov

Whaling Logbooks

Whaling Logbooks

1848 6 1     3723 29038 02 4    10ISABE*_N   1   5                                                           
165 20779701 69 5 0 1                  FFFFFF77AAAAAAAAAAAA     99 0 790044118480601
3714N 6937W                                                                           NW     51 NW     57 NW   
51                                          201A.STEWART       NEW BEDFORD             WHALING V
OYAGE           2620 199

Digitization is a multi-step process.

  1. Filtering
  2. Abstraction
  3. Representation (Visualization, Tabulation)

Matthew Maury

 

Confederate Navy Engraving 1862, from http://www.history.navy.mil/library/online/maury_mat_bene.htm

Abstract Logbooks

 

Harvard University Library

Deck 701, US Maury Collection (1789-c.1865)

Deck 701, US Maury Collection (1789-c.1865)

Classification

Classification

The expansion of whaling

The expansion of whaling

Deck 701

Deck 701

Deck 892, US shipping 1980-1997

Deck 892, US shipping 1980-1997

Logbook Digitization in the 1920s

 

Wallbrink, H. and F.B. Koek, Data Acquisition And Keypunching Codes For Marine Meteorological Observations At The Royal Netherlands Meteorological Institute, 1854–1968

Digitized logbooks, c. 1930

 

Wallbrink and Koek

ICOADS Deck 720, German weather data, 1876-1914

ICOADS Deck 720, German weather data, 1876-1914

ICOADS Deck 735, Soviet Research Vessels

ICOADS Deck 735, Soviet Research Vessels

Closeup of Deck 735. Soviet Vessels near the coast of South America.

Closeup of Deck 735. Soviet Vessels near the coast of South America.

ICOADS Deck 735, Russian Research Vessel (R/V) Digitisation

ICOADS Deck 735, Russian Research Vessel (R/V) Digitisation

German Deep Drifter Data

German Deep Drifter Data

Interactivity

1870 Census atlas, detail

1870 Census atlas, detail

Census frontiers (red) vs county boundaries

Census frontiers (red) vs county boundaries

Alphabet Cities


 

Murdoch’s album of science, via Scott Weingart

Jefferson’s Classification

Jefferson’s Classification

Organization of cataloging at the Library of Congress, 1909

Organization of cataloging at the Library of Congress, 1909

Organization of cataloging at the Library of Congress, 1909

Organization of cataloging at the Library of Congress, 1909

Understanding the Library

Google Ngrams

Google Ngrams

02138

02138

Google Ngrams

Google Ngrams

Books awaiting deposit, c. 1897

Books awaiting deposit, c. 1897

Linnean cards

Linnean cards

J. Edgar Hoover, 1916

Young Hoover 

FBI overflow files, 1944

Hoover files 

Getty Images/Mashable

A MARC record

Library Cataloging

Gatsby

Gatsby

Gatsby

Gatsby

Self-advancing number stamps

 

Making the Library Legible

Turning machine-readable books into machine-read books for classification.

Humans have rich, fuzzy understandings with allowance for uncertainty.

Computers force things into lifeless abstractions.

Humans, have rich, fuzzy understandings with allowance for uncertainty.

Bureaucracies force things into lifeless abstractions.

Computers (nowadays) have fairly rich, fuzzy understandings with allowance for uncertainty.

Prediction: Short, computer-readable embeddings of collections items will be an increasingly important shared resource for non-consumptive digital scholarship.

Rather than full text, a new method I’m calling “Stable Random Projection”:

  • Turn each book into 1280 numbers based on words
  • Random projection of log-word counts.
  • Unlike other dimensionality reductions, can work on all languages simultaneously.

Zoomable Visualization

Reproducing Classifications

Classifier suites:

  1. Re-usable batch training code in TensorFlow.

  2. One-hidden-layer neural networks can help transfer metadata between corpora.

  3. Protocol: 90% training, 5% validation, 5% test.

  4. Books only (no serials).

  5. All languages at once.

Classifiers trained on Hathi metadata can predict:

  1. Language
  2. Authorship on top 1,000 authors with > 95% accuracy. (Too good to be true)
  3. Presence of multiple subject heading components (eg: ‘650z: Canada– Quebec – Montreal’) with ~50% precision and ~30% recall.
  4. Year of publication for books with median errors ~ 4 years.

Library of Congress Classification

  • Shelf locations of books.
  • Widely used by research libraries in United States.
  • ~220 “subclasses” at first level of resolution.

Instances Class name (randomly sampled from full population)
461 AI [Periodical] Indexes
6986 BD Speculative philosophy
9311 BJ Ethics
40335 DC [History of] France - Andorra - Monaco
2738 DJ [History of the] Netherlands (Holland)
14928 G GEOGRAPHY. ANTHROPOLOGY. RECREATION [General class]
17353 HN Social history and conditions. Social problems. Social reform
4703 JV Colonies and colonization. Emigration and immigration. International migration
23 KB Religious law in general. Comparative religious law. Jurisprudence
5583 LD [Education:] Individual institutions - United States
3496 NX Arts in general
6222 PF West Germanic languages
68144 PG Slavic languages and literatures. Baltic languages. Albanian language
157246 PQ French literature - Italian literature - Spanish literature - Portuguese literature
6863 RJ Pediatrics

Chihuahua or Muffin

Misclassifications

Bacteriology 

  • Actual: HV: Social and Public Welfare/Criminology -> Welfare -> Protection, Assistance, Relief -> Special classes.
  • Algorithm: US Law.

Misclassifications: mdp.39015005002905

 

  • Actual LC Classification According to Hathi: AC 277 (Undefined)
  • Algorithm: DC (French History)
  • Shelf Location in Michigan: HC 277 (Economic History, France)

Misclassifications: uva.x000423222

 

  • Actual LC Classification: BF1613 Magic (White and Black). Occult Sciences -> Shamanism. Hermetics. Necromancy -> General Works, German, post-1800
  • Algorithm: BP: Theosophy, etc. BP595: Works by and about Rudolf Steiner.

Misclassifications

 

  • Actual LC Classification: QH1.A43 (Natural History)
  • Algorithm: QE (Geology)

Actual LC Classification: QB63.B5 1927

Bacteriology 

  • QB 63: Astronomy -> Stargazer’s guides.
  • QR 63: Microbiology -> Laboratory manuals

Actual LC Classification: QB63.B5 1927

Bacteriology 

  • BF 1611: Magic (White and Black). Shamanism. Hermetics. Necromancy -> General Works
  • Algorithm Says: HS
  • HS445.A2: Freemasons -> Masonic Law -> By Region or Country -> United States -> By State -> Constitutions.

Classifier online.

http://benschmidt.org/static_class/

Fissures in classification

  1. Can one speak meaningfully of the naturalness of a classifier?

  2. Is there something to be gained from preserving old hierarchical schemes against the flexible schemes that have replaced them?

  3. Is it OK to work with the digital library before we understand what’s in it?