title

Billion-point scatterplots

Billions of points

The European Space Agency has just released version three of the Gaia mission’s data about the location of stars in the sky. With many different parameters defined for about 1.8 billion stars, this is the largest dataset with x and y positions for points I know of. And so it’s a great opportunity to show off some of the features of the deepscatter library I’ve written for exploring large collections.

Edit Code
max_points: 100000
point_size: 3
alpha: 7.25
source_url: "https://files.benschmidt.org/tiles/gaia"
background_color: "#221133"
zoom:
  bbox: {"x":[-2,2],"y":[-1, 1]}
encoding:
  position: literal
  jitter_radius:
    constant: 1
    method: spiral
  jitter_speed:
    constant: .01
  color: 
    field: bp_rp
    range: rdbu
    domain: [-5, 5]

We’ll start off by looking at the 50,000 stars brightest from Earth. Here they are flattened out into the shape of the sky using the Hammer projection.

50,000 points is a lot. It’s more than enough to count; and it’s enough to start to strain many traditional ways of building charts.

Edit Code
max_points: 50e3
alpha: 20
point_size: 4
duration: 12000
encoding:
  position: literal
  jitter_radius:
    method: null
zoom:
  bbox: {"x":[-2,2],"y":[-1, 1]}

But 50,000 is not enough to see the structure of something like the Milky Way. Even showing the 100,000 brightest stars, as here, barely makes the outline visible.

Edit Code
duration: 5000
point_size: 3
max_points: 100e3

Only at half a million points can you really start to see the real structure here–the middle band of the milky way (here represented in [l, b] coordinates oriented along the galactic plane.)

At this point you may be seeing squares of data flashing into the display. I’ll give some info at the end about the data presentation strategy here. Suffice it for now to say now that each star is truly represented as a data point, not just an image; and that I’m using WebGL shaders that allow fairly comfortable rendering of millions of points at once on most modern machines.

This open up interesting possibilities for exploring all sort of datasets, including this one.

Edit Code
max_points: 5e5
alpha: 8.25
point_size: 2
encoding:
  size: 2
  x:
    field: x
    transform: literal
  y:
    field: y
    transform: literal

We can really push the envelope, here. Since I don’t know if you’re using mobile data, I’ll leave it to you to decide if you want to play with the sliderslow that load up to 3 million points into your screen–about 300MB of data, which compresses down to about 200 that we actually have to send. (Most datasets include text or categorical data, and so compress much better than this one does.)

But be warned for the rest of this essay–we’re going to load more data as we go, so you might want to bookmark this if you don’t want to clobber your mobile data limits.

At this point, I’m shipping about 60 MB of data over the wire in a couple hundred files.

Edit Code
encoding:
  position: literal
  color: 
    field: bp_rp
    range: rdbu
    domain: [-5, 5]
Number of points: 456802

This isn’t an image, just a serial drawing of points to the screeen. The basic appearance takes three parameters:

Number of points: 456802
Global Opacity--how dark or light the screen ought to be: 20
Point size--how many pixels at default zoom for each individual star.: 1

Because we’re plotting these as actual data points, any elements of the aesthetics can be configured on the fly.

Here, for example, I start you off with the points colored using D3’s ‘blues’ scales according to their magnitude seen from Earth.

But changing the API call means that each of these points can be displayed according to a different scheme.

Edit Code
duration: 500

Color

blues
viridis
magma
rainbow
oranges
purples
reds
cool
warm
plasma
turbo

Likewise magnitude can be encoded as size, so that brighter stars are larger.

Circle size for brightest stars.

0.1
0.5
1
2
5

And we can also change the variables encoded by that color.

Apparent brightness
Absolute brightness
X position
Y position

But Gaia includes more than just brightness information. An especially important part of the project’s data is that it creates parallax estimates for most of the stars it has observed, which measure how much their location shifts when viewed from different sides of the earth’s orbit.

High parallax means that stars move a lot and are close;

Scaling back to 500K points, I’ll change the color encoding to represent parallax angles.

Edit Code
max_points: 5e5
encoding:
  alpha: .5
  filter: {}
  color:
    field: parallax
    domain: [0, 10]
    range: viridis

Parallax is actually motion in the sky; so we can represent this somewhat more naturally–if more confusingly, because it doesn’t match the visual vocabulary we’re used to from the printed page–as a circular jitter.

Now each star is rotating around its central point with a distance proportional to how much it actually moves in the sky as the Earth transits the sun. (I don’t actually try to trace the path it would take in the projected space here–that takes a bit more trigonometry than I’d like to throw here.)

You can dynamically filter the points to high- or low-parallax values using the slider below.

Filter to parallax: 5
Exagerration of parallax: 0
Edit Code
jitter: circle
encoding:
  jitter_speed: .01
  color:
    field: parallax
    domain: [0, 10]
    range: viridis
  jitter_radius:
    method: circle
    field: parallax
    domain: [0, 100]
    range: [0, .05]

But I promised you a billion points. While Arrow and WebGL allow us to comfortably display tens of millions of points in the browsers, schlepping gigabytes of data directly to your browser is unreasonable. Deepscatter waits until until you want to zoom into a region to load the individual points on demand using a customized quadtree implementation.

Some parts of the Gaia set are outside the Milky Way proper; here is the Large Magellanic Cloud, where all of the stars are too far off (100,000 light years) to see.

Edit Code
duration: 10000
jitter: null
point_size: 1
encoding:
  filter: {}
  jitter_radius: 0
  color:
    field: bp_rp
    range: rdbu
    domain: [5, -5]

zoom:
  bbox:
    {"x":[1.03,1.13],"y":[0.33,0.38]}

In some areas, there are huge numbers of stars visible in tiny parts of the sky. In this portion of the Magellanic cloud lurk some stars numbered 1.7 billion or higher in the set.

Edit Code
duration: 10000
max_points: 5e5
encoding:
  alpha: 1
  size: 1
  color:
    field: bp_rp
    range: rdbu
    domain: [-5, 5]
zoom:
  bbox: {"x":[1.129,1.1295],"y":[0.368,0.3683]}

As we zoom back out, you can see just how tiny this portion of the sky is.

The portion of the sky we’re going to is 3x larger than the moon: the area covered by the Andromeda Galaxy. Since its individual stars are not visible, you may not be able to see anything in it at first…

Edit Code
duration: 20000
max_points: 1e6
zoom:
  bbox: {"x":[-1.5780744616098814,-1.5450523115250894],"y":[0.3505699535775513,0.3721224679810988]}

… but deepscatter allows you to change the target number of points displayed. Bumping up the scale to 3,000,000 points, the rings of the Andromeda galaxy become clearly visible.

Edit Code
duration: 2000
max_points: 3e6
Number of points: 456802

The triangulum galaxy (M33) is relatively nearby to Andromeda, both in the sky and in the local group; its definition is even clearer.

Edit Code
duration: 9000
zoom:
  bbox: 
    {"x":[-1.378083191427632,-1.3514810172803868],"y":[0.5654862401010377,0.582848638975793]}

Within our own galaxy are areas of intense stellar concentration as well. Here is Omega Centauri, a globular cluster with 10 million stars in a ball only 150 light-years wide–stars average only 0.1 light years apart.

These points may be overplotted–if so, you may want to adjust the zoom scaling parameter, which [I explained more here].

Edit Code
duration: 12000
max_points: 2e5
zoom:
  bbox: 
    {"x":[0.8319706197989606,0.8394665796325281],"y":[-0.14647132479462066,-0.14157894856028852]}
0

Here’s the full scale: Change the zoom balance and scroll back a panel if you wish.

0
Edit Code
encoding:
  position: literal
zoom:
  bbox: {"x":[-2, 2],"y":[-1, 1]}

Passing individual data also means that Deepscatter is able to arbitrarily change positions just like any other aesthetic.

Here’s a fairly straightforward plot: the x axis shows the parallax to a star with a log transform, and the y access shows the apparent magnitude from earth. The color is the absolute magnitude.

The diagonal lines here show stars of a fixed absolute magnitude: the farther away they are they are (x axis), the brighter they look from Earth (y axis).

Edit Code
max_points: 1e6
point_size: 2
duration: 2500
encoding:
  filter: {}
  y:
     field: phot_g_mean_mag
     transform: linear
     domain: [15, 2]
     range: [-3, 1]
  x:
     field: parallax
     domain: [300, .001]
     range: [-1, 1]
     transform: log
  color:
    field: abs_mag
    range: magma
    domain: [5, -5]

This particular relationship is an identity in this dataset–the absolute magnitudes are calculated directly from the parallax and g-band magnitude. If you filter by absolute magnitude, you get straight lines.

Show stars with an absolute magnitude around: 0

But other two-d representations from Gaia can show actual relationships. A plot common in astronomy is the Herzsprung-Russell diagram, which shows absolute magnitude plotted against color. The primary line visible here runs from upper-left to lower-right, and shows main sequence stars, which follow standard relationships of mass, age, color, and luminosity; the points off the line to the right are giants which can have substantially larger magnitudes and burn farther in the red spectrum.

(I think? I’m no astronomer.)

Edit Code
max_points: 1e6
point_size: 2
duration: 2500
encoding:
  filter: {}
  y:
     field: abs_mag
     domain: [-10, 15]
     range: [-1, 1]
     transform: linear
  x:
     field: bp_rp
     domain: [-5, 5]
     range: [-1, 1]
     transform: linear
  color:
    field: bp_rp
    range: rdbu
    domain: [5, -5]

This is an interesting case of data shaping because so many of the stars appear to be giants; but that’s only because high magnitude stars are definitionally easier to see when they’re farther away.

If we filter to just stars with a parallax over fifty,

Edit Code
alpha: 100
point_size: 20
encoding:
  filter:
    field: parallax
    op: gt
    a: 50
Show stars w/ parallax above: 30

Technical Details

Enough about stars, about which I know little, and time for some talk about data and tiling.

Formats

Lot of pages have a long loading bar when you open them up. That’s fine if you can push all the data at the start; but with more than a few million points, we need to push more or less instantaneously or else we’ll get blocking lag everytime we add more data. The easiest cycle– which I’ve used before–is to send things as CSV, parse them into Javascript numbers, and then draw from those with canvas. Converting those into single-precision floats before sending to the GPU makes rendering fast once they’re on the GPU: but it does nothing for the period up until then.

Luckily, a number of very talented people have been thinking hard about the future of data serialization in an age of GPU computation. I’m building off of what I find the most attractive of these projects by using the Apache Arrow format to send data to the browser. Arrow uses a standardized binary representation that imports directly into typed arrays in Javascript. If you store columns as 4-byte floats, this means that you can push the data straight to the GPU without having to parse it even once. Apache Arrow files can be gzipped before sending; the browser expands them straight into float32 Typed Arrays, which can be written straight to contiguous blocks of pre-allocated buffers on the GPU. This means that although this site uses javascript in many ways, the big core memory object are essentially being written on the server using fast C++ code called from python, and then mapped byte-for-byte to the GPU without having to actually manipulate them as native structures in either Python or Javascript.

To store this much data in a scatterplot, you need to load on demand. I use a quadtree structure where each tile has up to four children. The metadata of an Arrow file can contain the names of which children are actually loaded; when you zoom in to an area, only the points that are actually needed get loaded.

It’s necessary to use quadtrees instead of just using ordinary map tiles because some areas of this chart are much more sparsely populated than others. At high levels of zoom, this saves the browser from having to request thousands of csvs with just a few points in it; instead, this way, we can ensure that all tiles have about 65,000 points. The underlying Python code to create these quad tiles is part of the repo for this library.

Edit Code
encoding:
  position: literal
  filter: {}
alpha: 6
point_size: 1
zoom:
  bbox: {"x":[-2, 2],"y":[-1, 1]}

Click the button below to see the outlines of the currently loaded tiles. If you’ve been zooming around, you should see a few big rectangles covering the full area, and much smaller ones in just the areas we (or you) have looked at.

Show loaded quads

If you have questions, comments, or ideas, contact me through e-mail, on Twitter, or join the Deepscatter slack