Revolution #5: The Pain of Projections

Projections in geographic nomenclature are the selected set of methods one uses to take things that exist on a 3D object (i.e., “The Earth”), and presents them on a 2D space (i.e., “Every satellite image ever taken”).  They are the bane of introductory courses, and computationally can be a beast to get right.

Take, for example, the seemingly simple scenario where you have a square, and you want to get the mean value of temperature measurements inside that square:

18 19
21 22

where each measurement (18, 19, 21, 22) is a satellite-derived estimate of the average temperature in each image pixel.  So, four pixels, one red box I’m trying to estimate the overall average temperature for.  Simple, right?:

(18 + 19 + 21 + 22) / 4 = 20

In a perfect world, we’d be done – but, the projected world makes this a bit more challenging.  In practice within GeoQuery (and, most satellite-based data retrieval systems), we use a latitude and longitude system to define our grid.  For example, the grid cell with the “18” in it might be defined as starting in the upper-left hand corner at 37.27N 76.71W, and have a resolution of 0.1 decimal degrees – i.e., the upper-right hand corner is then at 37.27N, 76.61W.

So, why does this matter?  The short of it is that all decimal degrees are not created equal – depending on where you are around the world (i.e., how close you are to the poles), the length of 0.1 decimal degrees changes.

So, think back to our red box.  In practice, we can’t just average our four numbers, as the areas represented by each cell in the box are different.  Clever folk have derived a common solution to this, which is to “reproject” the raster data, effectively creating or removing raster grid cells (or otherwise warping them) to ensure every cell is the same size (referred to as an “equal area projection”).  However, this is both computationally expensive and requires imputation / interpolation, as you’re changing the source data itself.

Because we’re dealing with massive sets of data, just the computational challenge alone was enough to make reprojection the source data a non-starter.  As an alternative, what we do is called a Haversine Transformation. In effect, based on the latitude of every one of our cells, we’re able to create a weights matrix that “rescales” each cell to it’s true size.  So, the true values become something like:

K1 * 18 K1 * 19
K2 * 21 K2 * 22

where K1 and K2 are the haversine coefficients we apply to each row to rescale them to reflect their real size, not the decimal-degree size.  And, viola!  We can now take a realistic weighted mean for a given area.  We go into more detail on this in our GeoQuery user guide, but if you were ever wondering why we don’t just take an average and be done with it, now you know.

About the author: Daniel Runfola

Dan's research focuses on the use of quantitative modeling techniques to explain the conditions under which aid interventions succeed or fail using spatial data. He specializes in computational geography, machine learning, quasi-observational experimental analyses, human-int data collection methods, and high performance computing approaches. His research has been supported by the World Bank, USAID, Global Environmental Facility, and a number of private foundations and donors. He has published 34 books, articles, and technical reports in support of his research, and is the Project Lead of the GeoQuery project. At William and Mary, Dr. Runfola serves as the director of the Data Science Program, and advises students in the Applied Science Computational Geography Ph.D. program.