Network under heavy load

Incident Report for SNworks

Postmortem

SNworks attempted a system upgrade early this morning. While the database updates occurred without incident the switchover to the new server node did not.

We quickly realized that were were not able to get the new servers responding properly and made the decision to roll back the upgrade. SNworks maintains parallel infrastructure during upgrades, so it was a simple matter to switch the network end points back.

Unfortunately, due to the amount of time that passed, our caches were cold and the entire system needed to warm them back up again. Couple this with the dramatic increase in traffic we’ve seen over the last week and you have a recipe for failure.

We were able to quickly expand our server fleet to take some of the excess load and things returned to normal. Total downtime is estimated to be less than 20 minutes.

We sincerely apologize for any trouble this issue caused this morning and are holding this upgrade back until we can determine exactly what happened.

Posted Mar 18, 2020 - 11:32 EDT

Resolved

This incident has been resolved.

Posted Mar 18, 2020 - 11:17 EDT

Monitoring

Cache systems are properly handling load now. We're continuing to monitor the infrastructure.

Posted Mar 18, 2020 - 11:10 EDT

Identified

SNworks was forced to reverse an upgrade this morning, resulting in heavy load on our infrastructure. We're working to mitigate as quickly as possible.

Posted Mar 18, 2020 - 10:50 EDT

This incident affected: Legacy Front End and Guides Application.