Acknowledgements
Institutions
People
Acknowledgements
Funders
Mapping the Republic of Letters
Data Mining with Criminal Intent/The Old Bailey Online
ICOADS Deck 701, US Maury Collection (1789-c.1865) ICOADS
ICOADS Deck 735, Russian Research Vessel (R/V) Digitization
Bookworm Open Library
1 million books, 91 billion words
Some Bookworm instances at Northeastern and Rice:
bookworm.culturomics.org/arxiv
movies.benschmidt.org
Mentions of US Presidents in Ngrams and the Chronicling America Bookworm
Yale University: full run of Vogue Magazine
Medical Heritage Library (30,000 medical books)
US State Department
How do digital libraries expose their texts?
Bookworm Core Philosophy--infrastructures
A single backend can drive multiple representations.
Bookworm Core Philosophy--data structures
Why a grammar?
Grammar, part I: Creating a corpus
"search_limits": {
"publish_country":["United States"],
"year": {
"$lte": 1920,
"$gte": 1890
},
Grammar, Part II: Defining Text groups and tokenizations.
{
"search_limits": {
"word": ["need to"]
},
"database":"movies",
"aesthetic":{"x":"MovieYear","y":"WordsPerMillion"},
"plotType":"linechart"
}
{
"database": "RMP",
"plotType": "barchart",
"search_limits": {
"word": ["exams"]},
"aesthetic": {
"x": "WordsPerMillion",
"y": "gender"}
}
Exams and essays
{"database":"RMP",
"plotType":"pointchart",
"compare_limits":{"word":["test","tests","exam","exams","quiz","quizzes"],
"department__id":{"$lte":20}},
"search_limits":{"word":["essay","essays","paper","papers"],
"department__id":{"$lte":20}},
"aesthetic":{"x":"WordsRatio","y":"department","color":"gender"}
}
A new interactive site: Student assessments and gender prejudices
Grammar, Part II: Defining texts and tokens
{
"plotType": "heatmap",
"search_limits": {
"word": ["concentrate attention"],"year":{"$lte":1922,"$gte":1850}
},
"aesthetic":{"color":"WordsPerMillion","x":"decade","y":"BenSubject"},
"database":"presidio"
}
Grammar, Part II: Defining texts and tokens
Time interactions.
Library composition
{"search_limits":{
"publish_country":["United States"],
"year": {
"$lte": 1920,
"$gte": 1890
},
"word":["natural selection"]
},
"compare_limits":{
"publish_country":["United States"],
"year": {
"$lte": 1920,
"$gte": 1890
}
}
"counttype":["WordsPerMillion"]
Statistics returnable from comparing text and token counts across two collections:
Statistics returnable from comparing text and token counts across two collections:
Dunning-characteristic words by gender in negative reviews of history professors, Ratemyprofessor.com
Dunning-characteristic words by gender in negative reviews of history professors, Ratemyprofessor.com
Differing language in two TV shows
{"database": "movies",
"plotType": "wordcloud",
"search_limits": {
"MovieYear":[2009],"medium":"TV show"
},
"compare_limits": {
"MovieYear":[2009],"medium":"movie"
},
"aesthetic": {
"label": "unigram",
"size": "Dunning"
}}
Grammar part 3, alternate:
Simplest Extensions:
How Humanists use topic models badly:
Differing language in two TV shows
{
"database": "movies",
"plotType": "wordcloud",
"search_limits": {
"TV_show": ["Seinfeld"]
},
"compare_limits": {
"TV_show": ["Cheers"]
},
"aesthetic": {
"label": "topic_label",
"size": "Dunning"
}}
Differing language in two TV shows
Comparing in topic vocabulary:
{"database":"movies","plotType":"wordcloud",
"search_limits":{
"TV_show":["The Wire"],
"topic_label":["fucking shit fuck Fuck Shit ass bitch"]},
"compare_limits":{
"TV_show":["Deadwood"],"topic_label":["fucking shit fuck Fuck Shit ass bitch"]},
"aesthetic":{"label":"unigram","size":"Dunning"}
}
An interactive tool for exploring changing topic composition
Text Encoding Initiative produces micro-texts
{"database": "ChronAm",
"plotType": "map",
"method": "return_json",
"search_limits": {},
"aesthetic": {
"point": "placeOfPublication_geo",
"size": "TextCount"
}}
{"database": "ChronAm",
"plotType": "map",
"method": "return_json",
"search_limits": {},
"aesthetic": {
"point": "placeOfPublication_geo",
"size": "TextCount",
"time": "publish_year"}}
Newspaper flu coverage, 1917-1919
{"database": "ChronAm",
"plotType": "map",
"method": "return_json",
"search_limits": {"word":["flu","influenza","pneumonia"],
"publish_year":{"$lte":1920,"$gte":1918}},
"smoothingSpan":25,
"aesthetic": {
"time":"publish_day",
"point": "placeOfPublication_geo",
"size": "TextPercent"}}
{"database": "viral",
"plotType": "map",
"method": "return_json",
"search_limits": {
"topic": {
"$gte": 10
}
},
"aesthetic": {
"point": "placeOfPublication_geo",
"size": "WordCount",
"time": "topic",
"label": "topic_label"}}
{"database":"boston","plotType":"map","smoothingSpan":30,"method":"return_json","search_limits":{"word":["Dot Ave","Mass Ave"],"requested_month":{"$gt":100}},"aesthetic":{"point":"location","size":"WordCount","label":"service_name"},"counttype":["WordCount"],"groups":["location","service_name"]}
Code: github.com/bookworm-project
Docs: http://bookworm-project.github.io/Docs
Plugins: github.com/benmschmidt/Bookworm-Mallet
Project Description: bookworm.culturomics.org
Hathi Browser: http://sandbox.htrc.illinois.edu/bookworm
Ben Schmidt: benschmidt.org
These slides: benschmidt.org/slides/2014-11-12_Bookworm