First and foremost - we are sorry. The team worked hard all weekend and into Monday to keep services up and running, but we simply couldn't locate the root causes quickly enough. We let you down.
Starting Friday (4/24) at around 7pm we started noticing a traffic spike. By about 8pm that spike had overwhelmed the core server cluster and the traffic was still growing. Our servers have a great method for detecting system errors and triggering a reboot. Unfortunately, every running server, sensing that they were running at 1,000% of their nominal capacity, decided to restart at the same time triggering the first outage. As the servers rebooted, each coming online at different moments, the server would begin to serve all traffic. This drove it back up to 1,000s of times it's normal capacity. Which, of course, causes it to reboot putting the servers into a never-ending cycle of restarting.
Finally, around 8:30pm, we disabled our primary load balancer to allow all servers to come back online at once. We also added several very large, high-capacity servers to the mix to deal with the additional traffic. This seemed to solve the problem, for a bit at least.
Saturday started without issue, but around 1pm traffic started to kick back up. It again overwhelmed the core servers, but the high-capacity servers we added the previous day were able to handle the extra traffic.
Fortunately, we also had the report logs processed from the previous day's incident. From there we were able to see that, at one point, over 85% of all of our traffic was coming from a single IP address. Tracing this address back to the source, we determined that it was an over-active search indexer at one client's university. Being Saturday afternoon, and we were unable to contact their campus IT staff to stop the search indexer, we were left with one option - block the indexer.
Blocking it on both our front-end cache server and the core web servers seemed to have done the trick. Everything calmed right back down - even to the point of being able to shutdown the high-capacity servers.
We felt confident that the problem had been identified and at least stopped for now. We continued to monitor the infrastructure all afternoon and late into the evening looking for spikes. There were no significant spikes for 24 hours.
You can then imagine our surprise when a handful of sites started to report connectivity issues. Collating the reports, we found one commonality - all of the affected sites shared a database server (of which we have 3). Checking the server, we found it was at 600% its normal capacity, and at the very limit of its ability to function.
It appeared, initially, that a traffic spike caused a few servers to startup to deal with the load, then shutdown, but they didn't properly close their connections to that database. The remaining "sleeping" connections were consuming valuable resources. After being unable to manually shut those connections down, we ended up rebooting the system.
This appeared to solve the problem. The server returned to normal and we continued to monitor it throughout the afternoon and evening, without issue. While it seems odd that we'd experience two issues so close together, the two were completely unrelated.
Again, we were satisfied that the issue was resolved.
By 8:30am Monday morning, our alarm systems were going off again. The servers were being overwhelmed, unable to cope with the traffic. That problem database from the previous day was under extremely heavy strain again.
The only problem is, there wasn't the traffic to back it up. It was a normal Monday - yes a few clients had a bit more traffic than normal, but still a few others had a bit less. It usually all levels out on average.
We simply couldn't figure out the problem. In the end, we decided to start up the high-capacity servers again. That solved the problem of the servers crashing and restarting, but left us with one very over worked database. We split the clients on that database over an additional database and that seemed to calm things down quite a bit.
Continuing to monitor the systems and view logs as they were being processed (side note: processed logs are viewable after an hour, it's not possible to view them in real time), we discovered that while the university search indexer from Friday and Saturday was continuing to hammer away - it wasn't getting any data, we also noticed that one site was consuming an inordinate amount of database resources, but wasn't serving enough traffic to account for the resource usage.
Poking around on the affected site led to a bit of a chance discovery.
Preview links. Those URLs Gryphon generates when you want to view an article before it's published. They have a very specific trick - in addition to allowing you to see unpublished content, they also set a cookie that the cache server recognizes and immediately passes your request directly to the web server. So, after clicking that link, you are viewing your site completely uncached.
The cache server is super important. Without it, our web servers can serve 10s of requests a second. With it, they can serve 1000s of requests per second.
Large news events + uncached pages = servers not being able to handle traffic demands.
Clicking around on that client site, we noticed several of the links were preview URLs for the article (they look like this: http://statenews.com/article/preview/XXXX-XXXX-XXXX-XXXX). That meant that anyone who clicked on that link had an uncached view of the site. (Because they weren't logged into Gryphon, they couldn't see unpublished content, though). Then every subsequent view by that person was also uncached.
The end result of all of those uncached page views? One very over worked database server.
We quickly updated the URLs for the client site and changed the cookie name that bypasses the cache server. That took care of the first problem - now everyone was back to cached pages.
That left another problem, how to stop that from happening again. We quickly updated Gryphon to generate a single-use, unique preview URL. One that if shared would be completely unusable. It works like this:
That final URL is now safe to share - you can feel free to copy and paste that anywhere (remembering that only logged-in Gryphon users can see unpublished content). The URL that bypasses the cache server is only usable once so you don't have to worry about accidentally sharing it, causing site issues. If this intermediary link is somehow shared it will not work because it only works once. Thus eliminating the problem.
The first problem from Friday, though, is a bit tricker. It requires identifying the source of high traffic volumes, and stopping the flood. That generally requires a human and can't be easily done through software, so we're left with trying to deal with the traffic spike until a human can help to stem the traffic.
First, we're increasing the capacity of our "load based" servers. These are the machines that kick on automatically in times of high traffic. We're also increasing the time they stay alive after they've detected that traffic has slowed.
Next, we're continuing to normalize our database loads. By analyzing traffic and resource usage we're better able to pair lower volume clients with higher volume ones - evening out traffic loads across the system. It's an ongoing process we started some time ago and are always fine-tuning.
The SNworks Operations Team
April 28, 2015