System wide outage

Incident Report for SNworks

Postmortem

First, I personally apologize for the outage of Sept 6th. Frontend sites were unavailable for about 20 minutes, while CEO was unavailable for about 30 minutes. While downtime is unavoidable, an outage of more than 5 minutes is unacceptable. We will do better next time.

The primary cause of the outage was a misconfiguration in our automated deployment system. The error caused a cascade failure to the entire front end and CEO server fleet. As the system attempted to auto-recover, the misconfiguration continued to propagate to the new server nodes.

While we were able to quickly correct the error, the server infrastructure was stuck in a boot, configure, fail, boot cycle. In order to correct the cycle we forced the entire fleet into a “cold start.” The front-end servers were able to recover from a cold start in about 3.5 minutes, the CEO nodes took about 10 minutes longer.

Correction and Mitigation

First, we corrected the configuration error. We are also adding additional checks on build/deploy that will flag such errors before they propagate.

We are also mitigating the impact of similar issues by moving clients into auto-scaling “pods,” so an issue with a single deployment does not impact all clients.

Again, I apologize for the trouble and outsize impact this may have had on your publication. If you have any questions or comments, please do not hesitate to email me directly at mike@getsnworks.com or SNworks in general at howdy@getsnworks.com.

~mike
09/06/2023

Posted Sep 06, 2023 - 13:03 EDT

Resolved

All sites and CEO became unresponsive at approx 11:00AM Wed Sept 6th, 2023. Post-mortem to come shortly.

Posted Sep 06, 2023 - 11:00 EDT