GovernorHub ran into a few stumbling blocks back in March. Users might have noticed that the system ran more slowly than usual or worse, for a handful of people, the system fell over altogether (technical term!).
The GovernorHub team wanted to let you know what actually happened for those two weeks in March, why it happened and what they're doing about it.
Get ready for words like ‘heavy load’, ‘front end requests’ and ‘error clogging’ as we get the lowdown from GovernorHub lead developer, Kyle Selman.
I’m afraid pretty much anyone who used GovernorHub at peak times during this two week period was affected unfortunately. Due to the nature of the problem, the issues ramped up with an influx of users (heavy load) which meant that problems occurred just after lunch and again at 6pm when most governing board meetings take place.
Our users would have found that pages loaded very slowly - and what do you do when a page takes ages to load? You keep refreshing it, which then adds even more load to the system so we ended up with this snowball effect where once someone was affected, it became an even bigger issue as people started trying to refresh the page over and over again.
It’s actually very tricky with this kind of slowdown. GovernorHub is now a very complicated system that’s scaling at large. When you see this kind of problem, you first start to look at the logging systems to see what kind of errors are coming back from users. What we saw was that the system was struggling to fetch requests for resources (for things like bringing up a noticeboard post) and there was some sort of overload in the system. When something comes out of the blue like this, it’s usually quite a complex problem to solve and there are often quite a few red herrings to catch you out along the way.
Well the way GovernorHub works is that we’ve essentially got lots of different services - let’s call them different computers. Each different computer has a different job to do. One might be dealing with all of the processing that comes with governor training booking, for example. All of these computers need to be able to talk to each other. Sometimes you get bottlenecks - if you’re trying to talk to one computer and it’s busy doing some work already, it’s going to take longer to get the information (a front-end request) back out of that computer (service).
The more people that start using that service, the more chance there is of a service getting clogged up and it starts to slow down which then has a knock on effect on other parts of the system.
We’ve been gradually moving features over from older systems to newer ones and often adding new features. However this means more requests. So where it might have been, say, 10 requests for a certain page load in the past, that’s risen to 15.
The icing on the cake in March was having the busiest period we’ve ever had (1 in 2 governors in the country visited GovernorHub at some point during this period). More users than ever before will stress test a system further. Every system has an imaginary hill, as I like to describe it, where the system might fall down due to heavy load but it’s always unclear where that hill is until you reach it. In March, we hit our hill because of the rise in users and also the rise in requests.
We use Google Cloud Platform for all of our logging and it’s got some really useful metrics about our services which gave us some good graphs to illustrate the issues.
The grey peaks in the image (above) illustrate the number of requests we received and the red line of peaks illustrate the errors where things were getting bottle-necked. As you can see, we were gradually able to reduce the overall number of requests and also the amount of errors to a completely manageable level.
The root cause was some old code in a variety of places that wasn’t doing what it should - it was adding unnecessary requests to the system, so we went through to explore each one and update anything that wasn’t working as it should.
Well I find it hard to sleep when there’s a problem until I’ve solved it. You end up completely messing up your sleep cycle, so that did happen to some extent as I tried to get to the bottom of the issues.
No. I never sleep like a baby. That’s just the nature of software engineering. Your mind probably works in a certain way and you can’t easily rest if something isn’t working. There is always work to do to improve the system.
We’re still working to scale the system effectively. We’re always growing and there are always new things to add so I can’t make any absolute promises, but they can rest assured that we’re working really hard to develop the system to its full potential and we’ll do our best to keep that imaginary hill well out of sight.