You are looking at content from Sapping Attention, which was my primary blog from 2010 to 2015; I am republishing all items from there on this page, but for the foreseeable future you should be able to read them in their original form at sappingattention.blogspot.com. For current posts, see here.

Is catalog information really metadata?

Sep 05 2011

Weve been working on making a different type of browser using the Open Library books Ive been working with to date, and its raised a interesting question I want to think through here.

I think many people looking at word countson a large scale right now (myself included) have tended to make a distinction between wordcount data on the one hand, and catalog metadata on the other. (I know I have the phrase catalog metadata burned into my reflex vocabulary at this pointIve had to edit it out of this very post several times.) The idea is that were looking at the history of words or phrases, and the information from library catalogs can help to split or supplement that. So for example, my big concern about the ngrams viewer when it came out was that it included only one form of metadata (publication year) to supplement the word-count data, when it should really have titles, subjects, and so on. But that still assumes that word datacatalog metadata is a useful binary.

Im starting to think that it could instead be a fairly pernicious misunderstanding.

The argument for this is that words arent the base unit of measure at all. What we really care about are texts (which arent necessarily books, but it doesnt really hurt to think of them that way). Thanks to librarians, we have a number of pieces of information about each bookwhere it was written, how old its author was, etc. And thanks to computers, we can store thousands of new pieces of information about books that relate to their vocabulary: how many times it uses the word science, how many times it uses any form of the word evolution, how many words it has overall.

All these pieces of information are variables in the same data set. You can call them metadata or data, depending on what you think a book isbut its misleading to call one of them data and the other metadata. Pretending that word counts are data and the rest are metadata could promote at least four significant mistakes:

  1. We end up at once getting too hung up on word percentages as, in themselves, meaningful. But a word percentage has a very obscure meaning outside of the book its in. Treating each year as a long text that has its own percentages for each word is problematicit might make more sense, for example, to take the average for all books in that year, or some characterization of a beta distribution for the spread, or something else. (I have a post on this from last month in the hopper). All the people on Twitter who searched for love vs. war as if it meant something profound about human nature are only the most obvious example of this problem. The jump from this figure to fame, for example, is very problematic, because it doesnt take any of the other metadata into account besides year. To build up a plausible proxy for fame, youd need both some measure of which books are more popular, which we dont have (although see my third point, below); or at least some sense of how to weight different subject categories against each other, since some subjects breed books like rabbits and others dont.

When we talk about the history of words in history, were usually using them as imperfect proxies for the history of concepts. And when we talk about the history of concepts, were really talking about the history of groups of peoplewhat they believed, how that changed, who influenced whom. Word percentages tell us meaningful things about these questions, for the most part, only as far as we can define the groups of people were interested inand thats what catalog information is good for. Eliding the actual books from analysis is a bad thing.

  1. We keep ourselves from seeing all the ways variables can interact with each other. In some cases, words counts are just another form subject categorization to go alongside LC subject headings and BISAC codes. Just as I want to look at how often Lincoln is used in books that are published in 1875, I might want to look at how often Grant is used in books that mention Lincoln a lot. Thinking of word-counts as a different kind of data can blind us to how well it supplements our existing catalog data.

In the opposite direction, downplaying metadata can also keep us from seeing ways catalog data and word count data can interplay with each otherit took me too long, for example, to figure out how to make author age, publication year, and count data interact with each other; mostly that has to do with dimensionality, but I think it was also because because I was stuck seeing the metadata filtering stage as something that needed to be completed before looking at the word count information, rather than seeing the word counts as being just another element in the metadata filtering.

  1. We downplay the importance of creating catalog information. This is a more minor point, but still gets at something. Wordcount datas usefulness is limited by how much other data is included in the same series. If we treated it as just another form of catalog information when releasing it, it would always be released tied to some kind of unique book identifier. The more additional data we have about bookswhat language theyre in, what percentage of their words are in a foreign language, etc., the more useful they are. But if we treat word counts as important public resources but other catalog fields as something we wait for library institutions to create, well have less stuff to work with. Since most people still are on the fence about whether even wordcount information is a useful public resource, though, were a ways from having to worry about this problem. Still, I think we should all be more excited about the possibilities of creating and sharing other forms of derived catalog data from texts.