Thanks
Peter Organisciak (Hathi Trust Research Center) for code and comments.
HTRC and National Endowment for the Humanities
Link to these slides:
benschmidt.org/Montreal
Github:
github.com/bmschmidt/SRP (python)
Feature counts:
Tables of word counts for documents.
Distributed by HathiTrust, JStor, etc.
For books, feature counts can be long!
Hathi EF is several TB.
Bridging the infrastructure for tokenization with the infrastructure for metadata.
Or: how can we distribute word counts as part of metadata profiles?
Prediction: Short, computer-readable embeddings of collections items will be an increasingly important shared resource for non-consumptive digital scholarship.
Embeddings promote access by opening new interfaces and forms of discoverability.
But embeddings deny access to classes of artifacts and communities that don't resemble their training set.
What's the best way to reduce dimensionality?
How do we reduce dimensionality?
- Trivially parallelizable
- Allows client-server interactions
- preserving privacy/copyright concerns
- Reducing footprints
- Allows different corpora to exist in same space.
- Identifying duplicates
- Metadata extensions.
- Makes large datasets accessible for pedagogy.
Minimal, Universal reductions
Classic dimensionality reduction (pca, etc).
Starts with a document with W
words.
For each word, learn an N-dimensional representation.
Multiple the document vector by a WxN matrix to get a N-dimensional representation of the document.
Example: Wordcounts
Example: word weights
Example: projection
SRP steps
SRP steps
(Because SHA-1 is available everywhere).
eg: "bank"
bdd240c8fe7174e6ac1cfdd5282de76eb7ad6815
1011110111010010010000001100100011111110011100010111010011100110101
0110000011100111111011101010100101000001011011110011101101110101101
1110101101011010000001010111101101010000011110100111110111000101001
01101111100111110000100111101100010101101101101110101101101010
SRP steps
Cast the binary hash to [1, -1]
, (Achlioptas) generating a reproducible quasi-random projection matrix that will retain document-wise distances in the reduced space.
Put formally
D = Document, i = SRP dimension, W = Vocab size, h = binary SHA hash, w = list of vocabulary, c = counts of vocabulary.
Put informally.
A final trick
Word counts are log transformed so common words don't overwhelm dimensions.
Denny/Spirling preprocessing.
Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About I
Hundreds of ways of pre-processing
Preprocessing for minimal dimensionality reduction.
Regularize as far as possible without making language-specific assumptions.
Do:
Preprocessing for minimal dimensionality reduction.
Regularize as far as possible without making language-specific assumptions.
Don't:
Tokenization
One tokenization rule: unicode \w+
.
"Francois doesn't have $100.00" ->
["francois","doesn","t","have","###","##"]`
One tokenization rule: unicode \w+
.
Catastrophically bad for languages without whitespace tokenization!
Chinese, Thai, and some Devangari languages.
Unicode code-point based rules might be necessary--vowel splitting in Thai, character shingles in Chinese, etc.
Classifier suites:
Re-usable batch training code in TensorFlow.
One-hidden-layer neural networks can help transfer metadata between corpora.
Protocol: 90% training, 5% validation, 5% test.
Books only (no serials).
All languages at once.
Classifiers trained on Hathi metadata can predict:
Library of Congress Classification
Instances | Class name (randomly sampled from full population) |
---|---|
461 | AI [Periodical] Indexes |
6986 | BD Speculative philosophy |
9311 | BJ Ethics |
40335 | DC [History of] France - Andorra - Monaco |
2738 | DJ [History of the] Netherlands (Holland) |
14928 | G GEOGRAPHY. ANTHROPOLOGY. RECREATION [General class] |
17353 | HN Social history and conditions. Social problems. Social reform |
4703 | JV Colonies and colonization. Emigration and immigration. International migration |
23 | KB Religious law in general. Comparative religious law. Jurisprudence |
5583 | LD [Education:] Individual institutions - United States |
3496 | NX Arts in general |
6222 | PF West Germanic languages |
68144 | PG Slavic languages and literatures. Baltic languages. Albanian language |
157246 | PQ French literature - Italian literature - Spanish literature - Portuguese literature |
6863 | RJ Pediatrics |
Guesses for the text of "Moby Dick"
Class | Probability |
---|---|
PS American literature | 62.658% |
PZ Fiction and juvenile belles lettres | 30.729% |
G GEOGRAPHY. ANTHROPOLOGY. RECREATION | 5.385% |
PR English literature | 1.074% |
Why is "Moby Dick" classed as PR?
Top positive for class PR | Top negative for class PR |
---|---|
0.300% out (538.0x) | -0.294% as (1741.0x) |
0.289% may (240.0x) | -0.292% air (143.0x) |
0.258% an (596.0x) | -0.277% american (34.0x) |
0.239% had (779.0x) | -0.277% its (376.0x) |
0.238% are (598.0x) | -0.250% cried (155.0x) |
0.226% at (1319.0x) | -0.246% right (151.0x) |
0.221% english (49.0x) | -0.241% i (2127.0x) |
0.210% till (122.0x) | -0.231% days (82.0x) |
0.209% blow (26.0x) | -0.227% around (38.0x) |
0.208% upon (566.0x) | -0.205% back (164.0x) |
Classifying dates
Representing time as a ratchet;
To encode 1985 in the range 1980-1990: [0,0,0,0,0,1,1,1,1,1]
Classification of Lucretius.
Blue line is classifier probability for each year Red vertical line is actual date. Outer bands are 90% confidence.