The European Space Agency has just released version three of the Gaia mission’s data about the location of stars in the sky. With many different parameters defined for about 1.8 billion stars, this is the largest dataset with x and y positions for points I know of. And so it’s a great opportunity to show off some of the features of the deepscatter library I’ve written for exploring large collections.
max_points: 100000
point_size: 3
alpha: 7.25
source_url: "https://files.benschmidt.org/tiles/gaia"
background_color: "#221133"
zoom:
bbox: {"x":[-2,2],"y":[-1, 1]}
encoding:
position: literal
jitter_radius:
constant: 1
method: spiral
jitter_speed:
constant: .01
color:
field: bp_rp
range: rdbu
domain: [-5, 5]
We’ll start off by looking at the 50,000 stars brightest from Earth. Here they are flattened out into the shape of the sky using the Hammer projection.
50,000 points is a lot. It’s more than enough to count; and it’s enough to start to strain many traditional ways of building charts.
max_points: 50e3
alpha: 20
point_size: 4
duration: 12000
encoding:
position: literal
jitter_radius:
method: null
zoom:
bbox: {"x":[-2,2],"y":[-1, 1]}
But 50,000 is not enough to see the structure of something like the Milky Way. Even showing the 100,000 brightest stars, as here, barely makes the outline visible.
duration: 5000
point_size: 3
max_points: 100e3
Only at half a million points can you really start to see the real structure here–the middle band of the milky way (here represented in [l, b] coordinates oriented along the galactic plane.)
At this point you may be seeing squares of data flashing into the display. I’ll give some info at the end about the data presentation strategy here. Suffice it for now to say now that each star is truly represented as a data point, not just an image; and that I’m using WebGL shaders that allow fairly comfortable rendering of millions of points at once on most modern machines.
This open up interesting possibilities for exploring all sort of datasets, including this one.
max_points: 5e5
alpha: 8.25
point_size: 2
encoding:
size: 2
x:
field: x
transform: literal
y:
field: y
transform: literal
We can really push the envelope, here. Since I don’t know if you’re using mobile data, I’ll leave it to you to decide if you want to play with the sliderslow that load up to 3 million points into your screen–about 300MB of data, which compresses down to about 200 that we actually have to send. (Most datasets include text or categorical data, and so compress much better than this one does.)
But be warned for the rest of this essay–we’re going to load more data as we go, so you might want to bookmark this if you don’t want to clobber your mobile data limits.
At this point, I’m shipping about 60 MB of data over the wire in a couple hundred files.
encoding:
position: literal
color:
field: bp_rp
range: rdbu
domain: [-5, 5]
This isn’t an image, just a serial drawing of points to the screeen. The basic appearance takes three parameters:
Because we’re plotting these as actual data points, any elements of the aesthetics can be configured on the fly.
Here, for example, I start you off with the points colored using D3’s ‘blues’ scales according to their magnitude seen from Earth.
But changing the API call means that each of these points can be displayed according to a different scheme.
duration: 500
Likewise magnitude can be encoded as size, so that brighter stars are larger.
And we can also change the variables encoded by that color.
But Gaia includes more than just brightness information. An especially important part of the project’s data is that it creates parallax estimates for most of the stars it has observed, which measure how much their location shifts when viewed from different sides of the earth’s orbit.
High parallax means that stars move a lot and are close;
Scaling back to 500K points, I’ll change the color encoding to represent parallax angles.
max_points: 5e5
encoding:
alpha: .5
filter: {}
color:
field: parallax
domain: [0, 10]
range: viridis
Parallax is actually motion in the sky; so we can represent this somewhat more naturally–if more confusingly, because it doesn’t match the visual vocabulary we’re used to from the printed page–as a circular jitter.
Now each star is rotating around its central point with a distance proportional to how much it actually moves in the sky as the Earth transits the sun. (I don’t actually try to trace the path it would take in the projected space here–that takes a bit more trigonometry than I’d like to throw here.)
You can dynamically filter the points to high- or low-parallax values using the slider below.
jitter: circle
encoding:
jitter_speed: .01
color:
field: parallax
domain: [0, 10]
range: viridis
jitter_radius:
method: circle
field: parallax
domain: [0, 100]
range: [0, .05]
But I promised you a billion points. While Arrow and WebGL allow us to comfortably display tens of millions of points in the browsers, schlepping gigabytes of data directly to your browser is unreasonable. Deepscatter waits until until you want to zoom into a region to load the individual points on demand using a customized quadtree implementation.
Some parts of the Gaia set are outside the Milky Way proper; here is the Large Magellanic Cloud, where all of the stars are too far off (100,000 light years) to see.
duration: 10000
jitter: null
point_size: 1
encoding:
filter: {}
jitter_radius: 0
color:
field: bp_rp
range: rdbu
domain: [5, -5]
zoom:
bbox:
{"x":[1.03,1.13],"y":[0.33,0.38]}
In some areas, there are huge numbers of stars visible in tiny parts of the sky. In this portion of the Magellanic cloud lurk some stars numbered 1.7 billion or higher in the set.
duration: 10000
max_points: 5e5
encoding:
alpha: 1
size: 1
color:
field: bp_rp
range: rdbu
domain: [-5, 5]
zoom:
bbox: {"x":[1.129,1.1295],"y":[0.368,0.3683]}
As we zoom back out, you can see just how tiny this portion of the sky is.
The portion of the sky we’re going to is 3x larger than the moon: the area covered by the Andromeda Galaxy. Since its individual stars are not visible, you may not be able to see anything in it at first…
duration: 20000
max_points: 1e6
zoom:
bbox: {"x":[-1.5780744616098814,-1.5450523115250894],"y":[0.3505699535775513,0.3721224679810988]}
… but deepscatter allows you to change the target number of points displayed. Bumping up the scale to 3,000,000 points, the rings of the Andromeda galaxy become clearly visible.
duration: 2000
max_points: 3e6
The triangulum galaxy (M33) is relatively nearby to Andromeda, both in the sky and in the local group; its definition is even clearer.
duration: 9000
zoom:
bbox:
{"x":[-1.378083191427632,-1.3514810172803868],"y":[0.5654862401010377,0.582848638975793]}
Within our own galaxy are areas of intense stellar concentration as well. Here is Omega Centauri, a globular cluster with 10 million stars in a ball only 150 light-years wide–stars average only 0.1 light years apart.
These points may be overplotted–if so, you may want to adjust the zoom scaling parameter, which [I explained more here].
duration: 12000
max_points: 2e5
zoom:
bbox:
{"x":[0.8319706197989606,0.8394665796325281],"y":[-0.14647132479462066,-0.14157894856028852]}
Here’s the full scale: Change the zoom balance and scroll back a panel if you wish.
encoding:
position: literal
zoom:
bbox: {"x":[-2, 2],"y":[-1, 1]}
Passing individual data also means that Deepscatter is able to arbitrarily change positions just like any other aesthetic.
Here’s a fairly straightforward plot: the x axis shows the parallax to a star with a log transform, and the y access shows the apparent magnitude from earth. The color is the absolute magnitude.
The diagonal lines here show stars of a fixed absolute magnitude: the farther away they are they are (x axis), the brighter they look from Earth (y axis).
max_points: 1e6
point_size: 2
duration: 2500
encoding:
filter: {}
y:
field: phot_g_mean_mag
transform: linear
domain: [15, 2]
range: [-3, 1]
x:
field: parallax
domain: [300, .001]
range: [-1, 1]
transform: log
color:
field: abs_mag
range: magma
domain: [5, -5]
This particular relationship is an identity in this dataset–the absolute magnitudes are calculated directly from the parallax and g-band magnitude. If you filter by absolute magnitude, you get straight lines.
But other two-d representations from Gaia can show actual relationships. A plot common in astronomy is the Herzsprung-Russell diagram, which shows absolute magnitude plotted against color. The primary line visible here runs from upper-left to lower-right, and shows main sequence stars, which follow standard relationships of mass, age, color, and luminosity; the points off the line to the right are giants which can have substantially larger magnitudes and burn farther in the red spectrum.
(I think? I’m no astronomer.)
max_points: 1e6
point_size: 2
duration: 2500
encoding:
filter: {}
y:
field: abs_mag
domain: [-10, 15]
range: [-1, 1]
transform: linear
x:
field: bp_rp
domain: [-5, 5]
range: [-1, 1]
transform: linear
color:
field: bp_rp
range: rdbu
domain: [5, -5]
This is an interesting case of data shaping because so many of the stars appear to be giants; but that’s only because high magnitude stars are definitionally easier to see when they’re farther away.
If we filter to just stars with a parallax over fifty,
alpha: 100
point_size: 20
encoding:
filter:
field: parallax
op: gt
a: 50
Enough about stars, about which I know little, and time for some talk about data and tiling.
Lot of pages have a long loading bar when you open them up. That’s fine if you can push all the data at the start; but with more than a few million points, we need to push more or less instantaneously or else we’ll get blocking lag everytime we add more data. The easiest cycle– which I’ve used before–is to send things as CSV, parse them into Javascript numbers, and then draw from those with canvas. Converting those into single-precision floats before sending to the GPU makes rendering fast once they’re on the GPU: but it does nothing for the period up until then.
Luckily, a number of very talented people have been thinking hard about the future of data serialization in an age of GPU computation. I’m building off of what I find the most attractive of these projects by using the Apache Arrow format to send data to the browser. Arrow uses a standardized binary representation that imports directly into typed arrays in Javascript. If you store columns as 4-byte floats, this means that you can push the data straight to the GPU without having to parse it even once. Apache Arrow files can be gzipped before sending; the browser expands them straight into float32 Typed Arrays, which can be written straight to contiguous blocks of pre-allocated buffers on the GPU. This means that although this site uses javascript in many ways, the big core memory object are essentially being written on the server using fast C++ code called from python, and then mapped byte-for-byte to the GPU without having to actually manipulate them as native structures in either Python or Javascript.
To store this much data in a scatterplot, you need to load on demand. I use a quadtree structure where each tile has up to four children. The metadata of an Arrow file can contain the names of which children are actually loaded; when you zoom in to an area, only the points that are actually needed get loaded.
It’s necessary to use quadtrees instead of just using ordinary map tiles because some areas of this chart are much more sparsely populated than others. At high levels of zoom, this saves the browser from having to request thousands of csvs with just a few points in it; instead, this way, we can ensure that all tiles have about 65,000 points. The underlying Python code to create these quad tiles is part of the repo for this library.
encoding:
position: literal
filter: {}
alpha: 6
point_size: 1
zoom:
bbox: {"x":[-2, 2],"y":[-1, 1]}
Click the button below to see the outlines of the currently loaded tiles. If you’ve been zooming around, you should see a few big rectangles covering the full area, and much smaller ones in just the areas we (or you) have looked at.
If you have questions, comments, or ideas, contact me through e-mail, on Twitter, or join the Deepscatter slack