Binning big data so you can visualise it later
This article is originally published at https://robertgrantstats.wordpress.com
Sometimes I find myself writing some program to count or get some statistic for big data (i.e. too big to open in RAM all at once) into bins. Histograms are the simplest example of this, where the data are counted into bins in one dimension at a time. Nowadays, we all love hexagonal bins as a way of showing density when there are too many points to draw clearly; they are histograms in two dimensions. Binning is the quintessential big data activity; cf Wickham’s bin-summarise-smooth.
Last year Nick Cox pointed out to me that the only regular (i.e. all sides are the same length, all angles the same) two-dimensional shapes that tesselate (i.e. fill up the space without leaving gaps or overlapping) are those with 3,4 and 6 sides: the equilateral triangle, the square and the regular hexagon.
Then I started thinking about this in the big data context. Suppose I have to reduce my data to make it vizable, you know so I, the feeble human, can explore it and see what’s going on. That is time consuming and hard to program, so I want to do it once only if possible. How should I bin the data to keep my options open for later dataviz?
Here’s an example of what I mean. If I have two variables, like latitude and longitude of NYC taxi pickups, and I count them on a fine square grid, I can store that matrix of counts locally. Even if it is a big grid like 10,000 by 10,000 that will still be 100,000,000 numbers, which is quite manageable. Later, maybe I want to draw it on a 1000 by 1000 grid, so I just add together the counts in adjacent groups of 100 small squares to make one big square. That runs quickly.
// pseudocode: aggregate into a square grid
int[n_rows,n_cols] count_matrix
for i in 1:n_data {
int rownum=floor(y/row_height)
int colnum=floor(x/col_width)
count_matrix[rownum,colnum]=count_matrix[rownum,colnum]+1
}
// pseudocode: aggregate into a coarser square grid
int reduction=10 // each new square is the sum of a 10x10 area of the old grid
int new_rows=n_rows/reduction // assuming it is a multiple of reduction
int new_cols=n_cols/reduction
int[new_rows,new_cols] new_matrix
int left_corner=0 // if our language is zero-indexed
int top_corner=0
for i in 0:new_rows-1 {
for j in 0:new_cols-1 {
new_matrix[i,j]=sum(count_matrix[left_corner:(left_corner+reduction-1),top_corner:(top_corner+reduction-1)]
left_corner=left_corner+reduction
top_corner=top_corner+reduction
}
}
So I was thinking: can you combine shapes together easily? This called for some geometry, which was never my thing. Here we go.
Triangles can be combined in sets of four to make bigger triangles, or sets of six to make a hexagon. Squares combine in sets of four to make bigger squares, and so on. Hexagons don’t combine to make any of these shapes. So, what can I conclude? Bin your data in two dimensions using small squares or triangles, bearing in mind that the triangle will give you hexbins if you want them, but there is no crossing from tri-hex to square or back again. You could have a rhombus, but not a square.
Now, what about higher-dimensional binning? It seems that the only regular “space-filling polyhedron” in three dimensions is the cube (cf https://en.wikipedia.org/wiki/Honeycomb_(geometry)). There are some other shapes that are space-filling by themselves, but have a mixture of face shapes, so you probably don’t want to tangle with them because of the difficulty of determining whether a point is in this polyhedron or that polyhedron, and some mixtures of polyhedra which together fill space, but that’s also unsatisfying for this application because you want flexibility in aggregating them to larger polyhedra, and consistency when taking slices through them for visualisation. So, use cubes. I suppose if you really wanted a hexbin (and it’s a good visual format!), you should do 2-D triangle or hexagon bins from the outset; these could be stacked right prisms (think of the Giant’s Causeway) in 3-D which later get aggregated for a marginal plot or filtered for a conditional slice.
More than three dimensions eluded my non-existent powers of geometrical thinking but it seems to me that hypercubes always pay off. Not only can you aggregate a choix, but it’s also easy to allocate points to hypercubes: you just add more lines (those with the floor() function) to the code above, and more dimensions to the array count_matrix.
Bottom line: for k-dimensional data, bin in k-hypercubes. But if you know you want triangles or hexbins in 2-D projections or conditional slices, then you’ll have to do that from the outset.
Thanks for visiting r-craft.org
This article is originally published at https://robertgrantstats.wordpress.com
Please visit source website for post related comments.