At approx 8:15pm EST CEO suffered an outage related to primary database. Engineers were notified by automated alerting at 8:21pm EST. The issue was identified and a fix implemented. Service recovery began at approx 8:45pm EST. Post mortem to follow.
On the evening of February 17th at about 8:15pm EST the main CEO API, which powers both CEO2 and CEO3, suffered a severe outage. Service started recovery around 8:45pm EST and was fully recovered by 9pm EST. CEO incurred about 24 minutes of total downtime.
Engineers were alerted to the outage within 5 minutes by automated alerting systems.
The core of the issue is that the main CEO database ran out of storage space. Normally, when the database capacity drops below 20% an automatic process will:
All of this happens in the background with very limited, if any, service interruption.
The issue last night was caused by a required security certificate update necessary for database connectivity. When resizing database storage, you are not able to make additional modifications at the same time. So, when the database attempted to update the storage capacity, the certificate attempted to update itself at the same time, resulting in a failed update.
Since the primary and secondary are exact copies of each other, the same issue impacted both databases at the same time.
Upon realizing the the core issue, we determined a back-channel method for forcing the database to resize without requiring further changes. Once the primary was recovered, we reenabled connections and recovered the secondary.
We’ve updated our storage capacity alerting to provide for “louder” warnings when database storage drops below 20%, then 15%, then 10%.
If you have any specific questions, please do not hesitate to reach out to support@getsnworks.com.