Ben Schmidt
2018-01-28
Benschmidt.org/UNC
Choose 2 or 3 sources.
Post-its
Green: Working away on the task. Red: Stuck on something; help! White: If I’m typing, it’s to check the Huffington Post
Task now: Using the sources at benschmidt.org/UNC, successfully run the first block of code.
library(wordVectors)
politics = read.vectors("politics.bin")
Word embeddings place words into space that does the following:
Predicting words from context.
Source: Conor McDonald
Two dimensions is not enough! Neither is 3! 3d embeddings (a bad idea)!
Word2Vec uses a few hundred dimensions for each word. Here are some.
country | V1 | V2 | V3 | V4 | V5 | V6 |
---|---|---|---|---|---|---|
France Germany wine | 0.01500453 0.06001414 -0.04239146 | 0.018186183 0.046422540 -0.031893139 | 0.08340119 0.06336951 -0.05403588 | 0.009556154 0.002588225 0.033243816 | 0.016695607 -0.063688305 0.002391649 | 0.007973551 -0.026276992 0.048410353 |
Word Embeddings and Topic Models
Topic Models are better than word embeddings because:
Word embeddings are better than topic models because
library(wordVectors)
news = read.vectors("~/word2vec_models/short_google_news.bin")
Major options are:
word2vec: Fast, general-purpose, laptop-friendly.
GloVe: Better theoretical grounding, better if you’re running many models of different dimensionality on a server. Scales poorly.
But at this point, you could go crazy
Swivel, Poincaré (hierarchical) embeddings, Bernoulli Embeddings, and more.\_(ツ)_/¯
Short Answer: word2vec with skip-gram negative sampling.
Probably using my R package or gensim for python
Word2Vec on small (less than 1 million words) corpora
Think again
Try greatly expanding the window–default is 12 but try 50,
Some notes on preparation:
prep_word2vec
function in my package handles all this, somewhat slowly.Semantic vectors have meaning
“Capitalness” (?)
Creating a relational vector
male_female = model[["man"]] - model[["woman"]]
OR
male_female = model %>% extract_vectors(~ "man" - "woman")
teaching_vectors %>% closest_to("hola")
hola espanol todos fantastico porque
-4.440892e-16 5.419217e-01 5.561170e-01 5.582298e-01 5.584173e-01
interesante buen simpatico muy profesora
5.646364e-01 5.702220e-01 5.703893e-01 5.707483e-01 5.709447e-01
teaching_vectors %>% closest_to(teaching_vectors[["hola"]]-teaching_vectors[["spanish"]])
## hola hi goodmorning mmmmmkay wassup hello
## 0.2541650 0.6992011 0.7123197 0.7195401 0.7225225 0.7289345
## todos uhm hahahahah quot
## 0.7447951 0.7485577 0.7516170 0.7620057
“Semantics derived automatically from language corpora necessarily contain human biases,” Caliskan-Islam et al.
Translations to English from many gender-neutral languages such as Finnish, Estonian, Hungarian, Persian, and Turkish lead to gender-stereotyped sentences. For example, Google Translate converts these Turkish sentences with genderless pronouns: “O bir doktor. O bir hems ̧ire.” to these English sentences: “He is a doctor. She is a nurse.”
“Semantics derived automatically from language corpora necessarily contain human biases,” Caliskan-Islam et al.
We demonstrate here for the first time what some have long suspected (Quine, 1960)—that semantics, the meaning of words, necessarily reflects regularities latent in our culture, some of which we now know to be prejudiced.
“Semantics derived automatically from language corpora necessarily contain human biases,” Caliskan-Islam et al.
First, our results suggest that word embeddings don’t merely pick up specific, enumerable biases such as gender stereotypes (Bolukbasi et al., 2016), but rather the entire spectrum of human biases reflected in language. […] Bias is identical to meaning, and it is impossible to employ language meaningfully without incorporating human bias.
Alternative: (Schmidt 2015, Bolukbasi et al., 2016):
What’s new here is not that bias is finally proven, but that we manipulate around it.
genderless = news %>%
reject(news[["he"]] - news[["she"]]) %>%
reject(news[["man"]] - news[["woman"]]) %>%
Debiased pairs from rateMyProfessors
# [1] "he->she" "hes->shes"
# [3] "himself->herself" "his->her"
# [5] "man->woman" "guy->lady"
# [7] "grandpa->grandma" "dude->chick"
# [9] "wife->husband" "grandfather->grandmother"
# [11] "dad->mom" "uncle->aunt"
# [13] "fatherly->motherly" "brother->sister"
# [15] "actor->actress" "grandfatherly->grandmotherly"
# [17] "father->mother" "genius->goddess"
# [19] "arrogant->snobby" "priest->nun"
# [21] "dork->ditz" "handsome->gorgeous"
# [23] "atheist->feminist" "himmmm->herrrr"
# [25] "kermit->degeneres" "mans->womans"
# [27] "hez->shez" "himmm->herrr"
# [29] "trumpet->flute" "checkride->clinicals"
# [31] "gay->lesbian" "surgeon->nurse"
# [33] "daddy->mommy" "cool->sweet"
# [35] "monsieur->mme" "jolly->cheerful"
# [37] "jazz->dance" "wears->outfits"
# [39] "girlfriends->boyfriends" "drle->gentille"
# [41] "gentleman->gem" "charisma->spunk"
# [43] "egotistical->hypocritical" "cutie->babe"
# [45] "wingers->feminists" "professore->molto"
# [47] "gruff->stern" "demonstrations->activities"
# [49] "goofy->wacky" "coolest->sweetest"
# [51] "architect->interior" "sidetracked->frazzled"
# [53] "likeable->pleasant" "grumpy->crabby"
# [55] "charismatic->energetic" "cisco->cna"
# [57] "masculinity->gender" "girlfriend->boyfriend"
Aligning multiple models.
Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change
Hansard Corpus