Download PDF

For the second unit as well, I want to try a kind of group workset. First, we’ll divide into three or four groups of four or so around a specific corpus and problem; then we’ll talk about it.

I also want to emphasize again–this will not necessarily be easy, and much may be completely impossible if you don’t come to office hours! Please do! Don’t wait for your group to meet before you come see me!

The goal here is threefold:

  1. Leverage teamwork so you can experience what it’s like to undertake a text-analysis project, seeing where the major pitfalls and questions are without each person needing to do all the work that might entail on their own.
  2. Engage in a discussion of what some interesting questions might be from a particular corpus.
  3. Start trying out some forms of group collaboration before the project proposals are due in early November.

Instead of e-mailing the workset over the next week, we’ll take forty minutes at the start of class on October 30th (in two weeks) to go over each of the three groups. So this will be a sort of a grouped sprint1 towards a preliminary mini-works-in-progress conference.

The deliverables at the end will be:

  1. Someone should be prepared to talk for 10 minutes, probably with slides. Ten minutes won’t be long enough to describe everything you’ve done–be succinct!
  2. About 100-150 words on paper describing who did what.

Possible Corpora

Each group should choose a different method

Possible corpora include:

  1. Wordcount data from JSTOR data for research
  2. Wordcount data from some subset of the Hathi Trust
  3. Texts from the Internet Archive
  4. Newspaper articles from the 19th century United States
  5. Some smaller but fairly comprehensive corpus, like the one Mullen and Funk use.
    • For example, I’ve done some work with the State of the Union Addresses.
    • There are Hansard collections of parliamentary speeches in the UK or Australia over two hundred years.
    • Or build off of some of the documents scanned last week. You could try to generalize about OCR quality, say, using ngrams counts.

Possible methods

Each group should choose a different method unless you’ve got a great justification

Sample projects:


You’ll have to figure out how to divide stuff up for this, which is why the set is not due for two weeks.

But some potential roles are:

  1. In real life, the only actual grouped sprint in the world is the three-legged race. Ideally you’ll look like the 1-mile relay team handing off perfectly, but don’t worry if this ends up with your whole group face down in the mud, as long as you can explain how.