I’ve recently been getting pretty far into the weeds about what the future of data programming is going to look like. I use pandas and dplyr in python and R respectively. But I’m starting to see the shape of something that’s interesting coming down the pike. I’ve been working on a project that involves scatterplot visualizations at a massive scale–up to 1 billion points sent to the browser. In doing this, two things have become clear:
- Computers have gotten much, much faster in the last couple decades
- Our languages for data analysis have failed to keep up.
- Things will be weird, but also maybe good?
I tweeted about it once, after I had experimented with binary, serialized alternatives to JSON.
I’m writing about Python and R because they’re completely dominant in the space of data programming. (By data programming, I mean basically ‘data science’; not being a scientist, I have trouble using it to describe what I do.) Some dinosaurs in economists still use Stata, and some wizards use Julia, but if you want to work with data that’s basically it. The big problem with the programming lessons we use to work with data they run largely on CPUs, and often predominantly on a single core. This has always been an issue in terms of speed; when I first switched to Python around 2011, I furiously searched ways around the GIL (global interpreter lock) that keeps the language from using multiple cores even on threads. Things have gotten a little better on some fronts–in general, it seems like at least linear algebra routines can make use of a computer’s full resources.
JS/HTML is the low-level language for UI and Python and R.
ggplotly into lessons about ggplot for students who start wondering about tooltips on mouseover.
matplotlib are still king of the roost for publication-ready plots, but after becoming accustomed to interactive, responsive charts on the web, we are coming to expect exploratory charts to do the same thing; just as select menus and buttons from HTML fill this role in notebook interface, JS charting libraries do the same for chart interface.
The GPU-laptop interface is an open question
Let me be clear–something I’ll say in this following section is certainly wrong. I’m not fully expert in what I’m about to say. I don’t know who is! There are some analogies to web cartography, where I’ve learned a lot from Vladimir Agafonkin. Many of the tools I’m thinking about I learned about in a set of communications with Doug Duhaime and David McClure. But the field is unstable enough that I think others may stumble in the same direction I have.
This whole period, GPUs have also been displacing CPUs for computation. The R/Python interfaces to these are tricky. Numba kind of works; I’ve fiddled with gnumpy from time to time; and I’ve never intentionally used a GPU in R, although it’s possible I did without knowing it. The path of least resistance to GPU computation in Python and R is often to use Tensorflow or Torch even for purposes that don’t really a neural network library–so I find myself, for example, training UMAP models using the neural network interface rather than the CPU one even though I’d prefer the other.
Most of these rely on CUDA to access GPUs. (When I said I don’t know what I’m talking about–this is the core of it.) If you want to do programming on these platforms, you increasingly boot up a cloud server and run heavy-duty models there. Cuda configuration is a pain, and the odds are decent your home machine doesn’t have a GPU anyway. If you want to run everything in the cloud, this is fine–Google just gives away TPUs for free. But doing a group-by/apply/summarize on a few million rows, this is overkill; and while cloud compute is pretty cheap compared to your home laptop, cloud storage is crazy expensive. Digital Ocean charges me like a hundred dollars a year just to keep up the database backing RateMyProfessor; for the work I do on several terabytes of data from the HathiTrust, I’d be lost without a university cluster and the 12TB hard drive on my desk at home.
But I want these operations to run faster.
for ... of ... all work much like you’d expect (unlike the days when I spent a couple hours hunting out a rarely occuring bug in one data visualization that turned out to occur when I was making visualizations of wordcounts that included the word
constructor somewhere in the vocabulary); and many syntactic features like classes, array destructuring, and arrow function notation are far more pleasant than their Python equivalents. (Full disclosure–even after a decade in the language, I still find Python’s whitespace syntax gimmicky and at heart just don’t like the language. But that’s a post for another day.)
In online cartography, protobuffer-based vector files do something similar in libraries like
deck.gl. The overhead of JSON-based formats for working with cartographic data is hard to stomach once you’ve seen how fast, and how much more compressed, binary data can be.
WebGL is hell on rollerskates
In working with WebGL, I’ve seen just how fast it can be. For things like array smoothing, counting of points to apply complicated numeric filters, and group-by sums, it’s possible to start applying most of the elements of the relational algebra on data frames in a fully parallelized form.
But I’ve held back from doing so in any but the most ad-hoc situations because WebGL is also terrible for data computing. I would never tell anyone to learn it, right now, unless they completely needed to. Attribute buffers can only be floats, so you need to convert all integer types before posting. In many situations data may be downsized to half precision points, and double-precision floating points are so difficult that there are entire rickety structures built to support them at great cost Support for texture types varies across devices (Apple ones seem to pose special problems), so people I’ve learned from like Ricky Reusser go to great lengths to support various fallbacks. And things that are essential for data programming, like indexed lookup of lists or for loops across a passed array, are nearly impossible. I’ve found writing complex shaders in WebGL fun, but doing so always involves abusing the intentions of the system.
WebGPU and wasm might change all that
The last domino is a little off, but could be titanically important. WebGL is slowly dying, but the big tech companies have all gotten together to create WebGPU as the next-generation standard for talking to GPUs from the browser. It builds on top of the existing GPU interfaces for specific devices (Apple, etc.) like Vulkan and Metal, about which I have rigorously resisted learning anything.
WebGPU will replace WebGL for fast in-browser graphics. But the capability to do heavy duty computation in WebGL is so tantalizing that some lunatics have already begun to do it. The stuff that goes on into Reusser’s work] is amazing; check out this notebook about “multiscale Turing patterns” that creates gorgeous images halfway between organic blobs and nineteenth-century endplates
I haven’t read the draft WebGPU spec carefully, but it will certainly allow a more robust way to handle things. There is already at least one linear algebra library (i.e., BLAS) for WebGPU out there. I can only imagine that support for more data types will make many simple group-by-filter-apply functions plausible entirely in GPU-land on any computer that can browse the web.
When I started in R back in 2004, I spent hours tinkering with SQL backing for what seemed at the time like an enormous dataset: millions of rows giving decades of data about student majors by race, college, gender, and ethnicity. I’d start a Windows desktop cranking out charts before I left the office at night, and come back to work the next morning to folders of images. Now, it’s feasible to send an only-slightly-condensed summary of 2.5 million rows for in-browser work and the whole dataset could easily fit in GPU memory. In general, the distinction between generally available GPU memory (say, 0.5 - 4GB) and RAM (2-16GB) is not so massive that we won’t be sending lots of data there. Data analysis and shaping is generally extremely parallelizable.
JS and WebGPU will stick together
Once these things start getting fast, the insane overhead of parsing CSV and JSON, and the loss of strict type definitions that they come with, will be far more onerous. Something–I’d bet on parquet, but there are are possibilities involving arrow, HDF5, ORC, protobuffer, or something else–will emerge as a more standard binary interchange format.
Why bother with R and Python?
So–this is the theory–the data programming languages in R and Python are going to rely on that. Just as they wrap Altair and they wrap HTML click elements, you’ll start finding more and more that the package/module that seems to just work, and quickly, that the 19-year-olds gravitate towards, runs on the JSVM. There will be strange stack overflow questions in which people realize that they have an updated version of V8 installed which needs to be downgraded for some particular package. There will python programs that work everywhere but mysteriously fail on some low-priced laptops using a Chinese startup’s GPU. And there will be things that almost entirely avoid the GPU because they’re so damned complicated to implement that the Rust ninjas don’t do the full text, and which–compared to the speed we see from everything else–come to be unbearable bottlenecks. (From what I’ve seen, Unicode regular expressions and non-spherical map projections seem to be a likely candidate here.)
But I’ve already started sharing elementary data exercises for classes using observablehq, which provides a far more coherent approach to notebook programming than Jupyter or RStudio. (If you haven’t tried it–among many, many other things, it parses the dependency relations between cells in a notebook topologically and avoids the incessant state errors that infect expert and–especially–novice programming in Jupyter or Rstudio.) And if you want to work with data rather than write code, it is almost as refreshing as the moment in computer history it tries to recapitulate, the shift from storing business data in COBOL to running them in spreadsheets. The tweet above that forms of the germ of this rant has just a single, solitary like on it; but it’s from Mike Bostock, the creator of D3 and co-founder of Observable, and that alone is part of the reason I bothered to write this whole thing up. The Apache Arrow platform I keep rhapsodizing about is led by Wes McKinney, the creator of pandas, who views it as the germ of a faster, better