Unplanned Outage for TranscribersOfReddit
Incident date: 2021/08/12
Summary:
u/TranscribersOfReddit was unresponsive for approximately 12.5 hours, completely stalling all movement from volunteers. General instability from Reddit.com caused a cascading failure where a submission was not created in Blossom appropriately and the edge case code designed to catch this was not set up correctly.
User Impact:
Complete stoppage of work. Nothing was completed during this time.
Incident Response Analysis:
At approximately 3:45am UTC, our team heard from a volunteer that u/TranscribersOfReddit (the bot) was not responding to a message. At ~4am UTC, a different volunteer raised similar concerns. This is concerning because Bugsnag, our error reporting tool, correctly informed us of the Reddit instability an hour prior, but did not inform us of the repeated crashes starting during this time. Due to our ability to inspect server logs through Slack, the bot was observed starting, pausing for a moment, and immediately crashing, then restarting. No other information was available. The subreddit was closed after it became apparent that the issue would not resolve itself, though it did not happen until approximately nine hours after the original issue was observed. This is mostly due to an observed deficiency in our handbook for incident response which has already been remedied.
Roughly 12 hours after the issue was first observed, debugging in production was started and the issue was isolated to a specific process. A fix was pushed approximately 15 minutes later and immediately deployed to production.
Post-Incident Analysis:
There are a number of things that contributed to this outage with varying stages of fixable or not fixable. For example, staffing issues are generally considered in this context to be unfixable as we are a volunteer organization. Tooling issues are generally considered to be fixable.
- Bugsnag did not alert us to the crashes
- There should have been pings to the #botstuffs channel on Slack and emails generated by the repeating crashes; no errors were generated.
- No staff availability (not addressable)
- Inadequate testing did not catch edge case instability
- Communication of incident took a while, with volunteers able to attempt transcription well into the outage period
- Existing tooling (remote viewing of logs and deployment systems) aided tremendously in rapid identification of the issue and resolution, but was not enough to pinpoint the issue without debugging in production
Timeline:
All times in UTC.
- 3:45: issue raised
- 3:50: issue acknowledged, manual triage begins
- 4:07: issue shown affecting others
- 7:11: announcement made on Discord
- 12:49: suggestion made to close sub
- 14:39: subreddit closed with notice of issue directing people to Discord
- 16:00: debugging begins
- 16:17: fix deployed
- 16:23: subreddit reopened
Contributing factors:
- Tooling failure
- documentation inadequacy
- lack of adequate testing for edge cases
- staffing availability
- outdated procedure for dealing with outages
Lessons Learned:
- Just because the tooling is automated doesn't mean that it will always work.
- Procedure documentation has an expiration date. In our case, its expiration date was approximately one complete system and 3000 volunteers ago.
- Developers and staff without production access need as much access as possible for isolating issues -- the more hands and minds coming up with possible ideas, the faster we solve problems.
- Just because it works the first time doesn't mean it's actually right; the edge case passed initial testing because the testing data string was too short and the issue was only hit when the improper data was too large for the database column.
Action items:
- Update procedure documentation
- Investigate why Bugsnag did not fire
- Investigate any additional tooling that can be added to Slack through existing chatbots
- Expand logging in u/ToR
- Investigate and potentially expand logging on other services to match