Benjamin Schmidt
Assistant Professor of History, Northeastern University; Core Faculty, NuLab for Texts, Maps, and Networks
Northeastern University/Rice University Cultural Observatory
Erez Lieberman Aiden
Neva Cherniavsky, Martin Camacho, Matt Nicklay, Billy Janitsch, JB Michel.
Bill Lane Center for the American West: http://www.stanford.edu/group/ruralwest/cgi-bin/drupal/visualizations/us_newspapers
Textual Metadata: Correspondence Networks
Mapping the Republic of Letters
Textual Metadata: Trial Lengths
Data Mining with Criminal Intent/The Old Bailey Online
ICOADS Deck 701, US Maury Collection (1789-c.1865) ICOADS
ICOADS Deck 735, Russian Research Vessel (R/V) Digitization
books.google.com/ngrams
books.google.com/ngrams
Bookworm Open Library
1 million books, 91 billion words
Some Bookworm instances at Northeastern and Rice:
Yale University: full run of Vogue Magazine
http://bookworm.library.yale.edu/collections/vogue/
Medical Heritage Library (30,000 medical books)
US State Department
http://sandbox.htrc.illinois.edu/bookworm/
"search_limits": {
"publish_country":["United States"],
"year": {
"$lte": 1920,
"$gte": 1890
},
"word": ["focus attention"]
},"compare_limits":{
"publish_country":["United States"],
"year": {
"$lte": 1920,
"$gte": 1890
}
}
{"search_limits": {
"word": ["need to"]},
"database":"movies",
"aesthetic":{"x":"MovieYear","y":"WordsPerMillion"},
"plotType":"linechart"
}
{
"plotType": "heatmap",
"search_limits": {
"word": ["concentrate attention"]
},
"counttype": ["WordsPerMillion"],
"groups": ["decade", "subject"]}
Bicycles
Library composition
Statistics returnable from comparing text and token counts across two collections:
Dunning-characteristic words by gender in reviews of history professors, Ratemyprofessor.com
Dunning-characteristic words by gender in negative reviews of history professors, Ratemyprofessor.com
Dunning-characteristic words by gender in negative reviews of history professors, Ratemyprofessor.com
How Humanists use topic models badly:
Differing language in two TV shows
{"database": "movies",
"plotType": "wordcloud",
"search_limits": {
"TV_show": ["Seinfeld"]
},
"compare_limits": {
"TV_show": ["Cheers"]
},
"aesthetic": {
"label": "unigram",
"size": "Dunning"
}}
Differing language in two TV shows
Topics as tokens
{
"database": "movies",
"plotType": "wordcloud",
"search_limits": {
"TV_show": ["Seinfeld"]
},
"compare_limits": {
"TV_show": ["Cheers"]
},
"aesthetic": {
"label": "unigram",
"size": "Dunning"
}}
Differing language in two TV shows
Deadwood and the Wire, Topics
Deadwood and the Wire, Topics
{"database": "ChronAm",
"plotType": "map",
"method": "return_json",
"search_limits": {},
"aesthetic": {
"point": "placeOfPublication_geo",
"size": "TextCount"
}}
{"database": "ChronAm",
"plotType": "map",
"method": "return_json",
"search_limits": {},
"aesthetic": {
"point": "placeOfPublication_geo",
"size": "TextCount",
"time": "publish_year"}}
Newspaper flu coverage, 1917-1919
{"database": "ChronAm",
"plotType": "map",
"method": "return_json",
"search_limits": {"word":["flu","influenza"],
"publish_year":{"$lte":1920,"$gte":1917}},
"smoothingSpan":25,
"aesthetic": {
"time":"publish_day",
"point": "placeOfPublication_geo",
"size": "TextPercent"}}
{"database": "viral",
"plotType": "map",
"method": "return_json",
"search_limits": {
"chunk": [6]
},
"aesthetic": {
"point": "placeOfPublication_geo",
"size": "WordCount",
"time": "date_year"}}
Viral Texts: Ryan Cordell and David Smith, Northeastern University
{"database": "viral",
"plotType": "map",
"method": "return_json",
"search_limits": {
"topic": {
"$gte": 10
}
},
"aesthetic": {
"point": "placeOfPublication_geo",
"size": "WordCount",
"time": "topic",
"label": "topic_label"}}
Code: github.com/bookworm-project
Docs: http://bookworm-project.github.io/Docs
Plugins: github.com/benmschmidt/Bookworm-Mallet
Project Description: bookworm.culturomics.org
Hathi Browser: http://sandbox.htrc.illinois.edu/bookworm
Ben Schmidt: benschmidt.org
These slides: benschmidt.org/slides/Dartmouth