Revolution #7: Big Data & Spatial Information

Geographers like to say “everything happens somewhere”, and while there may be a few exceptions to the rule, in general it’s a true mantra.  Because of this, geography becomes a great equalizer when it comes to integrating across very disparate datasets – i.e., temperature, voting rates, health statistics, and perceptions of governance all are measured in very different ways, but are measured somewhere.

A very practical example: say you’re a researcher and you want to know if there are correlations between voting rates and obesity.  The obvious unit of observation for this is the individual – i.e., you survey hundreds of individuals, ask them if they voted or not (good luck with IRB), and then measure if they are obese or not.  Both resource and ethical limitations may inhibit your ability to conduct this analysis on the individual level.  However, if you turn to a geographic unit of analysis – i.e., a census block, county, state – you can then integrate across aggregate measures of obesity and voting rates, thus overcoming both challenges (it is likely surveys exist on both topics, and aggregate metrics avoid ethical concerns regarding asking individuals if they voted).  This example isn’t fair in all ways, but it’s a reasonably quick summary of why adding spatial location to information can be so powerful.

This is getting folk excited – you see “location analytics” popping up as a buzzword, and new disciplines (notably economics and government analysts) starting to leverage spatial data in fairly exciting ways.

Enter GeoQuery – for good or bad, we’re trying to blow the lid off of spatial data and open it up to anyone who wants to try their hand at using this amazing data trove.  But, it’s not all daisies and roses – spatial data comes with many unique challenges, doubly so when you’re integrating information.  Spatial uncertainty (i.e., a lack of knowledge of exact measures at every location) compounds as you integrate more datasets.  The unit of observation you choose can fundamentally change your analysis (the well studied Modifiable Areal Unit Problem, or MAUP). Events that result in “spatial spillover” can contaminate the independence of measurements across space, violating underlying assumptions of many statistical models.  The list of methodological concerns goes on and on, and so the risks associated with “opening up spatial data” to non-expert users are very real.

So, shouldn’t we turn GeoQuery off?  We argue that the benefits of opening these data sources up by far outweigh the risks of misuse – while overstating confidence is likely to happen, we argue that’s better than the alternative case of having no – or very little – data at all.  We’re also very hard at work publishing tools, documentation on appropriate use, and extensive metadata to mitigate these risks.  Finally, we also retain all extracts users request for the purposes of research replication, an important consideration in this somewhat uncharted terrain.

About the author: Daniel Runfola

Dan's research focuses on the use of quantitative modeling techniques to explain the conditions under which aid interventions succeed or fail using spatial data. He specializes in computational geography, machine learning, quasi-observational experimental analyses, human-int data collection methods, and high performance computing approaches. His research has been supported by the World Bank, USAID, Global Environmental Facility, and a number of private foundations and donors. He has published 34 books, articles, and technical reports in support of his research, and is the Project Lead of the GeoQuery project. At William and Mary, Dr. Runfola serves as the director of the Data Science Program, and advises students in the Applied Science Computational Geography Ph.D. program.