A zoomable map of language from news articles

This is an interactive, zoomable visualization of Google's word2vec model of Google News. I've laid out the million most common words using T-SNE. Regions of the plot correspond to distinct vocabulary clusters in news articles. To the far left, you can see the language of baseball; to the far right, the language of American politics. British politicians are towards the middle of the plot. At the very middle is geography. If you don't know what word2vec is, I wrote some description here and here. Some areas that are relatively empty at the highest zoom (for example, around the word "EU" in the lower middle) are very highly populated at higher zooms.

Although word2vec has a lot more dimensions that you see here, I find that this makes it easier to get a sense of what's in the corpus driving in a word2vec model, and how well that model is working. It takes a long time to build up a million-point T-SNE diagram, though--over a day, even using some shortcuts.

This is a kind of an interesting visualization problem as well. You can't just throw a million datapoints over the Internet to someone's browser. Plus an SVG will choke with those elements. You could serve relevant points from a database, but that's too compute-heavy. Here, I've built up a static website that loads the needed data on zoom, just the way map tiles work, effectively fitting 1,000,000 points into a single interactive plot. This is neat; it's not something I've seen done before in just this way. Points are drawn on a canvas, and separate quadtrees for each tile find the closest point. This strategy will scale to any number of points. Code (in a very messy state) on github here.

I made this primarily for a different T-SNE visualization I'm not done with yet, so I'm not fully spitting and polishing it up. Still, there are some additional things that would be nice. Particularly useful would be search. Since not all the data is loaded, to be able to put in any word and get its location is difficult to do in a static framework. I thought about implementing a hierarchical index using the filesystem to make queries possible. It would be easy to do with a database, of course. But that's no fun.