Chapter 3 Training your own model

If you want to train a model on your own text to look at before the workshop, that will take a little additional effort.

These models can take fifteen minutues to several hours to run depending on how large a corpus you have, so I do not recommend training one during the workshop.

If this is too intimidating, remember that you can also wait until after the workshop the follow these instructions, and work with the pre-circulated models there.

As before, you simply cut and paste the R code into your console.

3.1 Building test data

We begin by importing the wordVectors package.

library(wordVectors)

First we build up a test file to train on. As an example, we’ll use a collection of cookbooks from Michigan State University. This has to download from the Internet if it doesn’t already exist.

If you don’t want to work with cookbooks, don’t download this!

if (!file.exists("cookbooks.zip")) {
  download.file("http://archive.lib.msu.edu/dinfo/feedingamerica/cookbook_text.zip","cookbooks.zip")
}
unzip("cookbooks.zip",exdir="cookbooks")

The instructions below work with a folder called “cookbooks”. If you want to work with your own texts, you have two options:

Change the code below so that the origin folder is not called ‘cookbooks,’ but instead something else.
Instead, just store your texts in the folder called ‘cookbooks,’ and remember that you aren’t working with cookbooks. If you do this, make sure that you actually replace all the cookbooks in the folder.

3.2 Tokenizing and preparing

Then we prepare a single file for word2vec to read in. This does a couple things:

Creates a single text file with the contents of every file in the original document;
Uses the tokenizers package to clean and lowercase the original text,
If bundle_ngrams is greater than 1, joins together common bigrams into a single word. For example, “olive oil” may be joined together into “olive_oil” wherever it occurs. This is great for individual models, but will make it harder to compare between models.

You can also do this in another language: particularly for large files, that will be much faster. (For reference: in a console, perl -ne 's/[^A-Za-z_0-9 \n]/ /g; print lc $_;' cookbooks/*.txt > cookbooks.txt will do much the same thing on ASCII text in a couple seconds.) If you do this and want to bundle ngrams, you’ll then need to call word2phrase("cookbooks.txt","cookbook_bigrams.txt",...) to build up the bigrams; call it twice if you want 3-grams, and so forth.

prep_word2vec(origin="cookbooks",destination="cookbooks.txt",lowercase=T,bundle_ngrams=1)

To train a word2vec model, use the function train_word2vec. This actually builds up the model. It uses an on-disk file as an intermediary and then reads that file into memory.

3.3 Training

model = train_word2vec("cookbooks.txt","cookbook_vectors.bin",vectors=200,threads=4,window=12,iter=3,negative_samples=5)

A few notes:

The vectors parameter is the dimensionality of the representation. More vectors usually means more precision, but also more random error and slower operations. Likely choices are probably in the range 100-500.
The threads parameter is the number of processors to use on your computer. On a modern laptop, the fastest results will probably be between 2 and 8 threads, depending on the number of cores.
iter is how many times to read through the corpus. With fewer than 100 books, it can greatly help to increase the number of passes; if you’re working with billions of words, it probably matters less. One danger of too low a number of iterations is that words that aren’t closely related will seem to be closer than they are.
Training can take a while. On my laptop, it takes a few minutes to train these cookbooks; larger models take proportionally more time. Because of the importance of more iterations to reducing noise, don’t be afraid to set things up to require a lot of training time (as much as a day!)
One of the best things about the word2vec algorithm is that it does work on extremely large corpora in linear time.
In RStudio I’ve noticed that this sometimes appears to hang after a while; the percentage bar stops updating. If you check system activity it actually is still running, and will complete.
If at any point you want to read in a previously trained model, you can do so by typing model = read.vectors("cookbook_vectors.bin").

Now you can try it out to make sure it works. (If you don’t have the word ‘fish’ in your corpus, it might not work.)

model %>% closest_to("fish")