This is an experiment in describing how language works with a walk-through narration. Please scroll down or use the arrow keys to read this story. If you’ve been here before and want to just explore the map, click here.
A couple years ago, I put up an interactive that allows people to explore gendered language in teaching evaluations from the website Rate My Professors. A lot of people have found it interesting or useful for exploring how different expectations about gender structure the way students see their instructors in their classroom.
That site was also–intentionally–an incredibly undirected experience. You have to come up with your own words to search for; as a result, it’s much more useful for people who have already read a lot of teaching evaluations and know what language to search for.
This web site is a supplement to that one; it uses a new visualization method to provide more of a guided tour of a map of discourse that includes over 100,000 words used in evaluations. The goal here is to use some tools from natural language processing and machine learning to provide multiple points of entry into understanding how the discourse of teaching evaluations works.
Those tool give an anchor for exploring the question: what kinds of structures in language treat men and women differently?
At its heart is the chart you see beside this text: interactive map of the language used in teaching evaluations. There are more than 100 million words in the Rate My Professor set; I first use a method called “word2vec” to make those words more computable, and then cast that model into a two-dimensional space using the LargeVis algorithm.1. Each point represents a single word; words near each other tend to be used in similar contexts. Bigger words are larger; as you zoom in, you can see the rarer vocabularies in a region.
Hover over any point to see what word it is. You can also click on a word to go to my other site to view the gender breakdown by discipline in that word’s usage. (If you’re on a phone, you might have to wait until the end).
There are too many words used in evaluations to put them all on the screen at once; this web site lets us zoom in to progressively load more words in any interesting area.
Here’s a simple example of a cluster of language. A certain percentage of students like to spell the word “school” as “skool.” What other words do those students tend to use? The algorithms build us up a pretty cluster; they also love to use words including
because, and all sorts of words with the final “g” dropped–
teachin, and so forth. More common words are bigger; in this cluster
ur are some of the most common, but you need to zoom far in to see
This two-dimensional representation isn’t all of the information captured by a word2vec model, so I’ve exposed just a little bit more here as well. It can show some basic things about the sentiment and genderedness of the words through color. Here, I’m colorizing the words by how strongly they rate on a vector between “good” and “bad.” Red is bad and blue is good; so
urself tend to indicate a bad review, while
lolz tend to show up with good words.2
Sentiment, it turns out, is incredibly important in the way the full model is place into two dimensions.Most of the words are in a single, main cluster; when you look at the chart at that scale, the left half of the screen is dominated by good words, and the right half by bad ones.
Gender, on the other hand, is relatively evenly distributed. Here I’ve changed the coloring to show how relatively male (blue) or female (red) a word is. Most words show some gender skew, and many general topics skew towards one gender or another. But as a rule, it seems to be relatively easy for words that describe men to show up right next to similar words describing women. This is good; it means that we can use this overall map to see how gender plays out inside local clusters of meaning.
For example: the default plot in my original plot looked at how men are much more likely to be described as “funny” than are women. Is that because funniness in general is usually ascribed to men, or just the word “funny?” If we zoom in on a portion of the chart that includes a lot of more specific comedy-related words, it’s pretty clear that it’s the former going on here.
jokes are all male-biased words; there are a few
grandmas out there, but there’s no question that the overall balance is towards men getting the funny words.
One of the most-searched-for words on my other site is “bossy,” in part because that’s the contrast that Claire Cain Miller seeded people with at the Upshot. It’s not, in a lot of ways, the best word to use, because it’s actually relatively rare in evaluations.
But if we look at the general neighborhood, I think it becomes possible to see that it’s not just an outlier; there’s a wide variety of language in the neighborhood that corresponds to something like unprofessional behavior. Men are called out for being arrogant and pompous; but there’s simply a richer and more heavily deployed vocabulary around behavior by women that makes students feel about themselves. (This is pretty much the conclusion that vector space models tend to promote for me.) Even “patronizing,” which etymologically should be about men, is skewed female.
“Unprofessional” is probably the key term in this constellation, and one that seems like the one that carries water for bias the most easily.
There are some other patterns that are suggestive of interesting patterns of bias elsewhere in the negative cluster. Here’s a portion of the graph that describes disorganization and poor lecturing. Unlike unprofessionality, it doesn’t skew overall towards one gender or the other; men ramble, mumble, and are generally incoherent; women contradict themselves, lose the thread, and are just generally annoying.
Just reading the words, though, suggests an interesting overall pattern. As a rule, in this region, men tend to be described in terms of what they do while women are described in terms of what they are. Women are “scatterbrained” or “flighty;” men just ramble or act disorganized. This distinction seems like one that should be possible to pull out of the vector space overall, but it’s one I have a difficult time articulating precisely.
Other areas highlight the things students bring up differently in evaluations. In the section in which students complain about grades, almost every word is skewed female.
Other clusters give a sense of the predominant pattern of use for a word. For example, I initially assumed “children” in a review described complaints about professors talking about their kids. But looking at the larger cluster of vocabulary around the term, you can see that actually, it shows up when evaluators complain that their professors treat them like
kindergarteners (and every possible spelling thereof) and the like. This taps, again, into what strikes me as among the most important dynamic to consider around gender and teaching evaluations; that students challenge or malign the authority of female instructors more often.
Now let’s zoom out to the other side of the chart shows how positive language works. Here’s an area of pure praise–these are the sorts of words destined to be pulled off of teaching evaluations straight into cover letters and promotion dossiers. Within the cluster, there’s a mild male-female cleavage in space and a clearer one in meaning. Men are more often
geniuses; women are more often
caring. I’m particularly fascinated by ‘humble,’ because it kind of puts to a lie the idea that these words are simply reflecting actual differences in teaching style. Even positing, for the Larry Summers’ of the world, that men simply are more brilliant and women more caring–can you honestly imagine that there are men who display such immense humility in the classroom that students are compelled to record it? What would such a creature look like? Would a student even notice?
The alternate explanation–that students look at men and decide they need to mark them as “arrogant” or “humble”, and at women and decide to mark them as “caring” or “cold”–strikes me as far more compelling.
This is not the only cluster of positive language, of course. To the north is the valley of superlatives; men are the best, smartest, and funniest, while women are the sweetest, warmest, and dearest. Spend some time around this part of the chart, and you’ll see these same patterns again and again.
These gender ratios can tell you things that aren’t just about bias. Check out this cluster about math, for instance; every word is heavily male-skewed, much like mathematics professors themselves, except for the word “fractions,” which isn’t something you should be learning about in a college math class. Remedial math is usually taught by women, this suggests.
So is the situation entirely grim? There is one dog that fails to bark here; at least in rate my professor, there doesn’t seem to be much of a difference in language used to describe physical attractiveness. There is, of course, a major prompting effect on that web site, which encourages students to give chili peppers to faculty members they think are hot; but still, I found this one somewhat surprising. Of course there are different languages for men and women (handsome vs beautiful); but in general, there really isn’t much of a difference to talk about.
The same rule of quantity holds for clothing–if anything, it looks to me like there’s comparatively more conversation going on about men’s clothing than women’s. Let me be clear; that doesn’t mean that the way in which descriptions of bodies and clothes are deployed is equal or fair. “Bras” are mentioned far more than is appropriate in evaluations (the appropriate level, by the way, is pretty close to “not at all”). I suspect there’s strong reason to think that the tone of discussions of appearance may be different.
Still, everything I see in my search logs suggests that this isn’t what people think going in, and that’s worth knowing as well. Men in power may be comparatively non-responsive to objections to evaluations on the grounds that they’re overly focused on appearance.
I’ve written much more about word2vec here On LargeVis, see Tang et al, 2016. You may be more familiar with T-SNE, which is very similar in its outputs; while T-SNE can work on sets the size of this one, I’m doing some other work with variants of this chart that needs 15 million points, more than most T-SNE implementations can handle. I haven’t seen anyone implement zoom-on-load at arbitrary depth with hundreds of thousands of points before, but the basic design is heavily influenced by this Google map of artworks in T-SNE↩
The sentiment vector is composed by extracting the principal component of a number of vectors running between pairs of evaluative words: good to bad, best to worst, and so forth. The gender vector is similar. Note that this is just based on words, not on the numeric ratings.↩