Revolution #18: “Seemingly Simple Task”

About 2 years ago, I stumbled on a StackOverflow post that really summed up much of the experience we’re having with GeoQuery. After 2 years and 4 months (as of this post) – no accepted answer has been posted.

I am trying to do a seemingly simple task — extract mean pixel values from a raster based on a polygon overlay in PostGIS…
Based on help files and the aforementioned posts/questions, I arrived at the following PostGIS query:

SELECT id, (SUM((ST_SummaryStats(a.rast, 1, true)).sum)/SUM((ST_SummaryStats(a.rast, 1, true)).count)) as mean
FROM imagery.l8_2015_09_12 AS a, analysis_results.zonal_stats_test as b
WHERE ST_Intersects(b.geom,a.rast)
GROUP BY id;

The results are:

1: 9902.49
2: 10079.68
3: 12355.90
However, when I test the results against ArcGIS’s “zonal statistics to table” tool, I get the following results:

1: 8089.61
2: 7527.62
3: 12290.05
Close, but not identical.

So, to summarize: User wants the mean value of a raster, summarized to arbitrary boundaries.  They use two common tools – PostGIS and ArcGIS – to do so.  And, they get two (in my opinion, very) different answers.

This type of question is a common one, so how is it possible that powerhouse groups like ESRI and PostGIS disagree on the fundamental solution?  After all, there is a right answer: their is only one true mean value for (for example) nighttime lights within a given boundary for a given time point; there is only one true sum.

The Technical Answer

The process of aggregating raster data to vector boundaries is at the core of GeoQuery, and is commonly called “zonal statistics”.  Nearly every software platform you can imagine – QGis, ESRI products, PostGIS, and more – have mechanisms built in to conduct zonal stats.  With that power comes two enormous limitations, however:

(1) Zonal stats is computationally (and memory) expensive, increasingly so as the resolution of your source data gets higher.

(2) Zonal stats requires either the programmer or user to be exceptionally knowledgeable about a lot of esoteric details when it comes to spatial data (projections in particular).

We have an entire blog post on #2 , but the short of it is that even knowledge GIS practitioners have to make choices regarding accuracy when they reproject data; if you’re reading this and you know what “reprojection” is, this comes down to the deep, dark interpolation mechanics behind reprojection – stretching and warping a raster always involves some form of assumption, even if it’s not explicit.  This can result in slightly different answers to zonal statistics questions, and thus lead to confusion such as that in the stack overflow post.

On number 1, the memory and computation challenges come down to two fundamental steps required in zonal statistics: (a) creating a matrix of every cell that belongs in a given polygon, and (b) calculating the weight each one of those cells should have for a summation or mean calculation.  Both of these steps can be hugely challenging.  Imagine this scenario: You have an excel spreadsheet with 4 billion cells. Separately from that, you have a second excel sheet with 140 cells identified with a “1”.  How do you identify the 140 cells in the first spreadsheet you should be using?  Then, once you’ve done that, how do you handle cells that only partially should be counted (i.e., a boundary where half of a cell falls inside of the city, but half of the excel cell falls outside of the city)? While simple concepts, calculating these two things at scale becomes costly for most generic algorithms, which will hold the entirety of spreadsheet #1 in memory to perform manipulations on it.  The ways algorithms handle these (sometimes, just crashing!) can change the output of a zonal statistics approach in many ways, and are generally predicated on many assumptions.  Behind the scenes, we have GeoQuery do all sorts of things to mitigate this (I won’t go into detail here, but rest assured it’s very painful); so, while we can provide accurate solutions that lack the limitations of some generic desktop environments, the approaches we use are simply not suitable for most non-HPC environments.

The Organizational Answer

GIS is a scattered discipline – innovation in GIS comes from a huge range of actors.  Computer Scientists, GIScientists, Computational Geographers (like me!), and even economist, archaeologists, hydrologists, city planners, and a huge variety of others all contribute to GIScience.  Different groups have very different approaches to dealing with “fuzzy” data measurement issues like those noted above.  Computer Scientists may be more concerned about the runtimes of code; memory management and efficiency, and be willing to trade off precision for those things (frequently behind the scenes, or buried in help documents).  City planners may be more concerned about getting exceptionally accurate line or polygon data, so questions about how to appropriately summarize a raster to a polygon simply don’t get as much attention.  Economists broadly look at country-level summaries of data (though subnational is becoming more common), so these types of errors amount to very little for them.

All of these differing views of and needs for GIS means that it sometime takes a while – sometime a very long while – for “seemingly simple tasks” to become.. well, simple.  We’ll be the first to say that, today, this task still isn’t simple – memory management, physical disk space required, hugely variable spatial datatypes, and many more issues still make Cliff’s question a hard one to answer and implement a solution for.

About the author: Daniel Runfola

Dan's research focuses on the use of quantitative modeling techniques to explain the conditions under which aid interventions succeed or fail using spatial data. He specializes in computational geography, machine learning, quasi-observational experimental analyses, human-int data collection methods, and high performance computing approaches. His research has been supported by the World Bank, USAID, Global Environmental Facility, and a number of private foundations and donors. He has published 34 books, articles, and technical reports in support of his research, and is the Project Lead of the GeoQuery project. At William and Mary, Dr. Runfola serves as the director of the Data Science Program, and advises students in the Applied Science Computational Geography Ph.D. program.