Over the last few weeks of class, I want you to try presenting one of the algorithms we’ve looked at as appropriate on data from outside the sample set we’ve used in the class.
As everyone has started to get their hands on their data, you’ve been able to do some useful exploratory analysis of the sort from the first class. For this, I want you to take on one of the slightly more complicated algorithms we’re talking about, and apply it in R on your files.
It’s not important that you present exactly at the moment in the course we’ve talked about the algorithm. It is important we spread them out a bit.
So here’s a summary of the algorithms we’ve looked at, and those we will, so you can choose a time to present.
Choose one of the following. You should be able to explain in layman’s terms what the model does. (You probably won’t understand the math.)
Aim to put up a sketchy blog post with images and excerpts ahead of class; talk us through the post in class, with additional slides or references to your RStudio browser as necessary. Aim to talk about 8-12 minutes; include some exploratory exposition of your data before getting into the particular method.
After the presentation, bundle together all of your code and data in a single .Rmd file and send it to me for comment. It is critically important that every image you show be in code and data that I am able to run. This is not for my sake, but yours: if you’re going to use any of this work for your final projects, you can’t just be winging it.
Individual methods you may use (grouped by week)
Week 1. Vector-Space Approaches
- Principal Components Analysis (general purpose data: make a chart and read it.)
- Cosine Similarity (vocabulary-specific; susceptible to network representations, if you want to do that. Feel free to take the networks into Gephi, though I can show you how to do network viz in R.)
Week 2. Classification and Comparison
There are several possible classification models. This would require you to have data that has classes you want to try expand. One archetypal example we’ll be looking at is: can you tell the difference between poetry and prose based on vocabulary? What words are indicative of one or the other?
Some classification algorithms you can try.
- Naive Bayes
- Logit regression
- K-nearest-neighbor classification
- Support Vector Machines
Feel free to try more than one. (I doubt you’ll make it all the way down to SVMs). The challenge with these will be figuring out exactly how to
You could use these algorithms to find all sorts of other things. Many may work well in conjunction with some of the other algorithms we’ve used, such as principal components analysis to reduce dimensions before running a logit regression.
- Dunning Log-Likelihood (Read Ted Dunning’s paper or the Monk explanation).
- Mann-Whitney. (Find and read Ted Underwood’s old approach).
- Hand division of group A over group B.
Clustering and modeling:
These will require you to have a relatively heterogenous set of things that you want to explore the innate similarities and differences of for which you probably don’t have metadata.
- Topic Modeling. The most seductive of the bunch. Good for large collections of text with no metadata.
- K-means clustering. See what the emergent categories are in a set of unclustered items once you’ve designed the parameters. This would work well for long lists of anything.
- Hierarchical clustering. Something we’ve seen a lot of in pieces like Jockers and Alison.
This is less an algorithm than a programming skill set as we’ll deal with it here. This will require you to have texts or data that make heavy use of spatial elements. If you are working with texts, you can try running a “named entity extractor” to pull out placenames. This will be easier if you have a list of places that actually have latitude and longitude (like, for example, the data we worked with earlier).
If you have longitude and latitude, you’ll just need to you some features that can be usefully expressed geographically. Although in many cases it makes more sense to make maps using a GIS package like QGIS, you should stay in the ggplot2 environment for now. Before class we’ll give out some
For geography, no Mercator projections will be allowed. You can split all the infinitives or contract all the words you want in this class. This is my only pet peeve I’ll be enforcing on you.
Be sure to integrate geographical elements with other pieces of metadata about the points you’re plotting.