Reflecting on what we’ve read, and what you know from past experience, but without doing any Internet searches, write down four printed texts that you think may have escaped the dragnet of digitization.1 At least two should be a book; I’ll be happy if all four are, but you can also choose magazine articles, broadsides, newspaper editions, and so forth.
Check (figuratively and literally) whether your three texts exist in digitized, analogue, or physical forms in the following places. Put, let’s say, “A” for analogue, “D” for digital, and “N” for neither.
Library | Source 1 | Source 2 | Source 3 | Source 4 |
---|---|---|---|---|
Northeastern | ||||
Boston Public | ||||
Any library in eastern MA | ||||
Any library in Worldcat | ||||
Hathi Trust | ||||
Google Books | ||||
Internet Archive | ||||
A domain-appropriate engine |
One thing to keep in mind here is that the Internet is absolutely the worst way to search for books that haven’t been digitized.
Instead, you may do well to find a room of books somewhere that seems likely not to have been scanned. If you’re totally stumped, ask around about what other people are finding. There are a lot of different strategies for this.
In my house, for example, I’ve turned up the following books that don’t seem to have been digitized (I haven’t done a totally exhaustive search yet):
–
Take responsibility for one of the texts identified in step 3. You have first dibs on your own. But: identify something not in copyright. No academic books from the last 30 years, say.
Scan it to an image format. If it’s a whole book, you can do just a chapter or two, or split up with someone else. This step will be much easier if you work with others. (Only two people should need to go to the BPL, or Tufts, or wherever.) I’m happy for any three people to share a single book.
Perform OCR on the image. You can use Adobe Acrobat, Bill Turkel’s instructions, or some instructions for OS X from a previous time this class was offered.. The easiest way for short sources like this is to use Google Docs, which allows you to upload a PDF and get OCR out.
Export the OCR’ed text as a .txt
format; post it to your blog with a brief bibliographic preamble. (This doesn’t count towards the blog post quota, it’s just a place to store it).
Clean it up until you’ve fixed 50 mistakes in the main text or the whole document is spotless, whichever comes first. If you want to think about regular expressions as Bill Turkel does as a way to programatically fix problems, you could instead compile a list of some general changes.
By printed texts, I mean, affixed to paper by a printing press of some sort. By “escaped the dragnet,” I mean: physical copies exist in the real world, but there aren’t copies in “the cyber,” whether downloadable or not. If they are online but not accessible, they have been digitized. I think I’m also OK with typewritten things that are mimeographed, but don’t go nuts with these.↩