Acknowledgements
Institutions
People
Acknowledgements
Funders
Partners
Mapping the Republic of Letters
Data Mining with Criminal Intent/The Old Bailey Online
Bookworm Open Library
1 million books, 91 billion words
Some Bookworm instances at Northeastern and Rice:
bookworm.culturomics.org/arxiv
movies.benschmidt.org
Mentions of US Presidents in Ngrams and the Chronicling America Bookworm
Yale University: full run of Vogue Magazine
Medical Heritage Library (30,000 medical books)
US State Department
How do digital libraries expose their texts?
Bookworm Core Philosophy--infrastructures
A single backend can drive multiple representations.
Bookworm Core Philosophy--data structures
Why a grammar?
Grammar, part I: Creating a corpus
"search_limits": {
"publish_country":["United States"],
"year": {
"$lte": 1920,
"$gte": 1890
},
Grammar, Part II: Defining Text groups and tokenizations.
{
"search_limits": {
"word": ["need to"]
},
"database":"movies",
"aesthetic":{"x":"MovieYear","y":"WordsPerMillion"},
"plotType":"linechart"
}
{
"search_limits": {
"word": ["iPhone"],"date_year":{"$gte":2002,"$lte":2015}
},
"database":"RMP",
"aesthetic":{"x":"date_year","y":"WordsPerMillion"},
"plotType":"linechart"
}
{
"database": "RMP",
"plotType": "barchart",
"search_limits": {
"word": ["exams"]},
"aesthetic": {
"x": "WordsPerMillion",
"y": "gender"}
}
Exams and essays
{"database":"RMP",
"plotType":"pointchart",
"compare_limits":{"word":["test","tests","exam","exams","quiz","quizzes"],
"department__id":{"$lte":20}},
"search_limits":{"word":["essay","essays","paper","papers"],
"department__id":{"$lte":20}},
"aesthetic":{"x":"WordsRatio","y":"department","color":"gender"}
}
Grammar, Part II: Defining texts and tokens
{
"plotType": "heatmap",
"search_limits": {
"word": ["concentrate attention"],"year":{"$lte":1922,"$gte":1850}
},
"aesthetic":{"color":"WordsPerMillion","x":"decade","y":"BenSubject"},
"database":"presidio"
}
Grammar, Part II: Defining texts and tokens
Time interactions.
Library composition
{"search_limits":{
"publish_country":["United States"],
"year": {
"$lte": 1920,
"$gte": 1890
},
"word":["natural selection"]
},
"compare_limits":{
"publish_country":["United States"],
"year": {
"$lte": 1920,
"$gte": 1890
}
}
"counttype":["WordsPerMillion"]
Statistics returnable from comparing text and token counts across two collections:
Statistics returnable from comparing text and token counts across two collections:
Dunning-characteristic words by gender in negative reviews of history professors, Ratemyprofessor.com
Dunning-characteristic words by gender in negative reviews of history professors, Ratemyprofessor.com
Grammar part 3, alternate:
Simplest Extensions:
How Humanists use topic models badly:
Comparing the Vocabulary of two TV shows
{
"database": "movies",
"plotType": "worddiv",
"search_limits": {
"TV_show": ["Seinfeld"]
},
"compare_limits": {
"TV_show": ["Cheers"]
},
"aesthetic": {
"label": "unigram",
"size": "Dunning"
}}
Replacing "words" with "topics"
{
"database": "movies",
"plotType": "worddiv",
"search_limits": {
"TV_show": ["Seinfeld"]
},
"compare_limits": {
"TV_show": ["Cheers"]
},
"aesthetic": {
"label": "topic_label",
"size": "Dunning"
}}
Comparing in topic vocabulary:
{"database":"movies","plotType":"worddiv",
"search_limits":{
"TV_show":["The Wire"],
"topic_label":["fucking shit fuck Fuck Shit ass bitch"]},
"compare_limits":{
"TV_show":["Deadwood"],"topic_label":["fucking shit fuck Fuck Shit ass bitch"]},
"aesthetic":{"label":"unigram","size":"Dunning"}
}
An interactive tool for exploring changing topic composition
Text Encoding Initiative produces micro-texts
{"database": "ChronAm",
"plotType": "map",
"method": "return_json",
"search_limits": {},
"aesthetic": {
"point": "placeOfPublication_geo",
"size": "TextCount"
}}
{"database": "ChronAm",
"plotType": "map",
"method": "return_json",
"search_limits": {},
"aesthetic": {
"point": "placeOfPublication_geo",
"size": "TextCount",
"time": "publish_year"}}
Newspaper flu coverage, 1917-1919
{"database": "ChronAm",
"plotType": "map",
"method": "return_json",
"search_limits": {"word":["flu","influenza","pneumonia"],
"publish_year":{"$lte":1920,"$gte":1918}},
"smoothingSpan":25,
"aesthetic": {
"time":"publish_day",
"point": "placeOfPublication_geo",
"size": "TextPercent"}}
Code: github.com/bookworm-project
Docs: http://bookworm-project.github.io/Docs
Plugins: github.com/benmschmidt/Bookworm-Mallet
Project Description: bookworm.culturomics.org
Hathi Browser: http://sandbox.htrc.illinois.edu/bookworm
Ben Schmidt: benschmidt.org
These slides: benschmidt.org/slides/2015-02-25_Bookworm