Revolution #10: System Resilience

One ongoing challenge we face is preparing for large amounts of system demand – which can be very spikey.  Because of the system design of GeoQuery, we effectively have three potential points of failure:

  • Front end server – The website people see (geoquery.org and the data front-ends).
  • Database server – What the front-end pings to (for example) display boundary files and available datasets for a user.
  • Backend (HPC) – Where jobs are actually run.

When we talk about GeoQuery outside of our technical group, it’s frequently assumed our primary limitation is on our back-end processing infrastructure.  We talk about the number of processor hours we’ve used (about 30 years worth over the last couple of months), memory (up to 256 gigs for some jobs), and terrabytes of disk space we (~100, all said) occupy, but what really matters is peak demand.  Looking at our some of our nodes right now:

where each blank dot represents a core we’re not using, and letters/colors represent a running job. So, if you’re up at around 6AM EST (as I write this sentence) on Friday, December 2nd, you’ll have no issue getting into the system due to the back-end.  Try again on Wednesday, Dec 7th when I’m presenting GeoQuery in training sessions at the World Bank and the GEF, and you may be waiting a day or two for your results as dozens of folk get ahead of you in line.

So, while we can be core constrained (and frequently are), it’s not a 24/7 issue. We don’t want to buy hundreds of cores that are just sitting idle 12 hours/day, waiting for a job; on the flip-side, we don’t want to buy a tremendously capable front-end server just to have 10 visitors on it at any given moment.

“Why not the cloud?” you say.  Unfortunately, our processing/compute jobs are expensive to run – memory size can run to 256 gigs, and processing times can run in the hours; if we started paying AWS (or other providers) for our infrastructure, we would have to shut down in a month.  Maybe a week.

We’re constantly working to bring more back-end capacity online to offset these peak challenges, but the downside to being able to offer these services free is that, in addition to other geoquery users, we’re also sharing our nodes with various departments here at William and Mary.  That’s right – your spatial data analysis is being run right alongside a physics simulation of neutrino particles I couldn’t even begin to name.  So, while we have some dedicated capacity, the majority of our additional capacity is being rolled out in these shared environments.

The short of it?  Next time your process request takes 4 hours, blame a physicist – but know we’re working on it!

About the author: Daniel Runfola

Dan's research focuses on the use of quantitative modeling techniques to explain the conditions under which aid interventions succeed or fail using spatial data. He specializes in computational geography, machine learning, quasi-observational experimental analyses, human-int data collection methods, and high performance computing approaches. His research has been supported by the World Bank, USAID, Global Environmental Facility, and a number of private foundations and donors. He has published 34 books, articles, and technical reports in support of his research, and is the Project Lead of the GeoQuery project. At William and Mary, Dr. Runfola serves as the director of the Data Science Program, and advises students in the Applied Science Computational Geography Ph.D. program.