Dimensionality reduction as infrastructure

Ben Schmidt: Northeastern University

August 2017

Thanks

  • Columbia University, SIPA.
  • Peter Organisciak (Hathi Trust Research Center) for code and comments.

  • HTRC and National Endowment for the Humanities

Link to these slides:

benschmidt.org/Montreal

Github:

github.com/bmschmidt/SRP (python)

Outline

  1. Situationally ideal dimensionality reduction vs. universal, minimal dimensionality reduction.
  2. How random projection can enable a universal, minimal dimensionality reduction for texts
  3. An exploration of 13 million books, reduced, from the Hathi Trust.
  4. Case study: reproducing library classification.

1. Current Methods.

Feature counts:

Tables of word counts for documents.

Distributed by HathiTrust, JStor, etc.

For books, feature counts can be long!

Hathi EF is several TB.

Bridging the infrastructure for tokenization with the infrastructure for metadata.

Or: how can we distribute word counts as part of metadata profiles?

Prediction: Short, computer-readable embeddings of collections items will be an increasingly important shared resource for non-consumptive digital scholarship.

Embeddings promote access by opening new interfaces and forms of discoverability.

But embeddings deny access to classes of artifacts and communities that don't resemble their training set.

What's the best way to reduce dimensionality?

  1. Principal Components Analysis on the Term-Document matrix
  2. Latent Semantic Indexing
  3. Semantic Hashing, etc.

How do we reduce dimensionality?

  1. PCA on the words we can count.
  2. Top-n words.
  3. Topic models.

Reasons not to use the best methods.

  1. Computationally intractable: full matrix is trillions of rows, sparse matrix is billions.
  2. Difficult to distribute; users at home can't embed documents in your space.
  3. Any embedded space is optimal only for the text collection it was trained on.

A Minimal, Universal Dimensionality Reduction for text should:

  1. Be domain-agnostic.
  2. Be language-agnostic.
  3. Be capable of accounting for any vocabulary.
  4. Make only general assumptions about human language.
  5. Be capable of working from existing feature-count datasets.
  6. Be Easily implementable across platforms and languages.

Costs

  1. Inefficient use of dimensions (like top words).
  2. Dimensions cannot be cognizable, social things.

Costs

  1. Inefficient use of dimensions (like top words).
  2. Dimensions cannot be cognizable, social things.

Benefits

  1. Trivially parallelizable
  2. Allows client-server interactions
  • preserving privacy/copyright concerns
  • Reducing footprints
  1. Allows different corpora to exist in same space.
  • Identifying duplicates
  • Metadata extensions.
  1. Makes large datasets accessible for pedagogy.

2. Stable Random Projection.

Minimal, Universal reductions

  • MinHash?
  • The high-dimensional "Hashing Trick"
  • Locality-sensitive hashing

Classic dimensionality reduction (pca, etc).

Starts with a document with W words.

For each word, learn an N-dimensional representation.

Multiple the document vector by a WxN matrix to get a N-dimensional representation of the document.

Example: Wordcounts

Example: word weights

Example: projection

SRP steps

  1. Begin with wordcounts

SRP steps

  1. Take SHA-1 hashes for all words.

(Because SHA-1 is available everywhere).

eg: "bank"

bdd240c8fe7174e6ac1cfdd5282de76eb7ad6815
1011110111010010010000001100100011111110011100010111010011100110101
0110000011100111111011101010100101000001011011110011101101110101101
1110101101011010000001010111101101010000011110100111110111000101001
01101111100111110000100111101100010101101101101110101101101010

SRP steps

Cast the binary hash to [1, -1], (Achlioptas) generating a reproducible quasi-random projection matrix that will retain document-wise distances in the reduced space.

Put formally

D = Document, i = SRP dimension, W = Vocab size, h = binary SHA hash, w = list of vocabulary, c = counts of vocabulary.

Put informally.

  • Each dimension is the sum of the wordcounts for a random half the words, minus the sum of the wordcounts for the other half.
  • Words that have the similar vocabulary will be closer on all the dimensions.

A final trick

Word counts are log transformed so common words don't overwhelm dimensions.

Denny/Spirling preprocessing.

Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About I

Hundreds of ways of pre-processing

Preprocessing for minimal dimensionality reduction.

Regularize as far as possible without making language-specific assumptions.

Do:

  • Remove punctuation
  • Lowercase.
  • Replace numeric digits with '#'

Preprocessing for minimal dimensionality reduction.

Regularize as far as possible without making language-specific assumptions.

Don't:

  • Lemmatize
  • Remove stopwords
  • Remove infrequent words
  • Use bigrams.

Tokenization

One tokenization rule: unicode \w+.

  • "Francois doesn't have $100.00" ->

    ["francois","doesn","t","have","###","##"]`

One tokenization rule: unicode \w+.

Catastrophically bad for languages without whitespace tokenization!

Chinese, Thai, and some Devangari languages.

Unicode code-point based rules might be necessary--vowel splitting in Thai, character shingles in Chinese, etc.

.

3. Uses of the shared space.

13 million books in a scatterplot

Classification accuracy on Underwood/Sellars dataset.

Duplicate Detection

Similarity Search Engine

Classifier suites:

Re-usable batch training code in TensorFlow.

One-hidden-layer neural networks can help transfer metadata between corpora.

Protocol: 90% training, 5% validation, 5% test.

Books only (no serials).

All languages at once.

Classifiers trained on Hathi metadata can predict:

  1. Language
  2. Authorship on top 1,000 authors with > 95% accuracy. (Too good to be true)
  3. Presence of multiple subject heading components (eg: '650z: Canada-- Quebec -- Montreal') with ~50% precision and ~30% recall.
  4. Year of publication for books with median errors ~ 4 years.

4. Learning Library Classifications

Library of Congress Classification

  • Shelf locations of books.
  • Widely used by research libraries in United States.
  • ~220 "subclasses" at first level of resolution.

Instances Class name (randomly sampled from full population)
461 AI [Periodical] Indexes
6986 BD Speculative philosophy
9311 BJ Ethics
40335 DC [History of] France - Andorra - Monaco
2738 DJ [History of the] Netherlands (Holland)
14928 G GEOGRAPHY. ANTHROPOLOGY. RECREATION [General class]
17353 HN Social history and conditions. Social problems. Social reform
4703 JV Colonies and colonization. Emigration and immigration. International migration
23 KB Religious law in general. Comparative religious law. Jurisprudence
5583 LD [Education:] Individual institutions - United States
3496 NX Arts in general
6222 PF West Germanic languages
68144 PG Slavic languages and literatures. Baltic languages. Albanian language
157246 PQ French literature - Italian literature - Spanish literature - Portuguese literature
6863 RJ Pediatrics

.

Accuracy by language

.

Model Interpretability

Guesses for the text of "Moby Dick"

Class Probability
PS American literature 62.658%
PZ Fiction and juvenile belles lettres 30.729%
G GEOGRAPHY. ANTHROPOLOGY. RECREATION 5.385%
PR English literature 1.074%

Why is "Moby Dick" classed as PR?

Top positive for class PR Top negative for class PR
0.300% out (538.0x) -0.294% as (1741.0x)
0.289% may (240.0x) -0.292% air (143.0x)
0.258% an (596.0x) -0.277% american (34.0x)
0.239% had (779.0x) -0.277% its (376.0x)
0.238% are (598.0x) -0.250% cried (155.0x)
0.226% at (1319.0x) -0.246% right (151.0x)
0.221% english (49.0x) -0.241% i (2127.0x)
0.210% till (122.0x) -0.231% days (82.0x)
0.209% blow (26.0x) -0.227% around (38.0x)
0.208% upon (566.0x) -0.205% back (164.0x)

Online Classification Exploration

NYPL classed

Classifying dates

Classifying dates

Representing time as a ratchet;

To encode 1985 in the range 1980-1990: [0,0,0,0,0,1,1,1,1,1]

Classification of Lucretius.

Blue line is classifier probability for each year Red vertical line is actual date. Outer bands are 90% confidence.