Hathidy

Ben Schmidt

2020-03-12

Slides:

benschmidt.org/slides/hathidy

Package:

https://github.com/HumanitiesDataAnalysis/hathidy

The heart of the ‘tidyverse’

Extracted Features

General vision: wordcount data should be able to meet scholars where they are.

Teaching Humanities Data Analysis

Teaching with HTRC features

Principle: counting, joining, and modeling are transferable skills that can be used on any data set.

So students need their own sets. Thus: 🐘

Rsync is a good way to get everything, but is intimidating for ordinary users

HTTP endpoints for files have made cross-platform access much easier.

HTRC Feature Reader

HTRC Feature Reader (python)

Hathidy (R)

tidyverse integration
Use tidyverse and other packages (tidyverse, tidytext, etc.) for all actual analysis.
Built for teaching, safe for research.
Focused on fast access to token data.
Push users towards particular formats.

HTRC Feature Reader (python)

Hathidy (R)

Follow the established universe of R packages (tidyverse, tidytext, etc.) for all actual analysis.
Built for teaching, but save for research.
Focused on fast access to token data.
Strong push towards particular storage formats.

Core principles for working with extracted features.

Fast access means fast to code and fast to load.

Uniform model; you must access by HTID. The user is only dimly aware there are files involved.

But there are; and they are cached on disk.

Currently using pairtree and csv; but that is likely to change to flat and parquet.

uma.ark:/13960/t0dv2df4m