Google Ngrams
Bookworm Project
Institutions
People
HathiTrust grant Partners
Acknowledgements
Funders
Partners
Some Bookworm instances at Northeastern and Rice:
benschmidt.org/OL
bookworm.culturomics.org/arxiv
movies.benschmidt.org
Yale University: full run of Vogue Magazine
Medical Heritage Library (80,000 medical books, and an increasing number of journals.)
http://mhlbookworm.ngrok.io/
Columbia History Lab
http://www.history-lab.org/about-collections
US State Department
Uses of bicycle, by week, in newspapers
{
"database": "ChronAm",
"plotType": "heatmap",
"search_limits": {
"publish_year": {
"$gte": 1860
},
"word": ["bicycle"]
},
"aesthetic": {
"x": "publish_year",
"y": "publish_week_year",
"color": "WordsPerMillion"
} }
Publication places over time of all the public domain volumes in Hathi Trust
{
"database": "hathipd",
"plotType": "map",
"method": "return_json",
"search_limits": {
"date_year": {
"$gte": 1800,
"$lte": 1922
}
},
"projection": "albers",
"aesthetic": {
"time": "date_year",
"point": "publication_place_geo",
"size": "TextCount"
}
}
Newspaper flu coverage, 1917-1919, by day
{"database": "ChronAm",
"plotType": "map",
"method": "return_json",
"search_limits": {"word":["flu","influenza","pneumonia"],
"publish_year":{"$lte":1920,"$gte":1918}},
"smoothingSpan":25,
"aesthetic": {
"time":"publish_day",
"point": "placeOfPublication_geo",
"size": "TextPercent"}}
benschmidt.org/profGender
Why embed?
There are too many words.
What's the best way to reduce dimensionality?
Reasons not to use the best methods.
Stable random projection (SRP) Process
Why classify automatically?
The Library of Congress Classification
Success by language
{
"database": "hathipd",
"plotType": "barchart",
"method": "return_json",
"search_limits": {
"languages__id": {"$lte": 10},
"LCC_guess_is_correct": ["True"]
},
"compare_limits": {
"languages__id": {"$lte": 10 },
"LCC_guess_is_correct": ["False", "True"]
},
"aesthetic": {
"x": "TextPercent",
"y": "languages"
}
}
Error by subclass
{
"database": "hathipd",
"plotType": "barchart",
"method": "return_json",
"search_limits": {
"LCC_guess_is_correct": ["False"]
},
"compare_limits": {
"LCC_guess_is_correct": ["False", "True"]
},
"aesthetic": {
"x": "TextPercent",
"y": "lc_classes"
}
}
Classifying dates
Representing time as a ratchet;
To encode 1985 in the range 1980-1990: [0,0,0,0,0,1,1,1,1,1]
Classification of Lucretius.
Blue line is classifier probability for each year Red vertical line is actual date. Outer bands are 90% confidence.
Next steps