Summary of impact:
From approximately 17.15 UTC on Thursday 5th January 2023 until 01:45 UTC on Friday 6th January 2023, a small amount of email sends temporarily stalled before they were successfully completed.
Root Cause:
We made a change in our APAC (R3) region to improve email link tracking. We monitored this change after deployment and observed it to be running as desired. It was subsequently released to our European (R1) and North America (R2) regions. Our initial observations and monitoring verified the change was behaving as expected.
Following on from those initial positive signs, we noticed our application logs were showing our sending services had been stopping/restarting periodically which had caused some sends to stall. We’re continuing our investigations into what caused these services to restart.
Mitigation:
The timeline (in UTC) for resolving this issue was:
- 22:00 - Our monitoring system altered us to some email sends that hadn’t completed as expected and we raised an incident.
- 22:33 - We spotted the root cause from our application logs as being our sending services periodically restarting.
- 22:40 - We rolled back the early change in our European region (R1) and observed positive improvements.
- 22:50 - We rolled back the early changes in our North American (R2) and APAC (R3) regions. And we confirmed our sending services returned to normal.
- From 23:00 Thu 5th Jan to 01:45 Fri 6th Jan: We worked on identifying the stalled sends and resuming them.
- 01:45 Fri 6th Jan - We had successfully resumed all stalled sends.
Next Steps:
In order to prevent a reoccurrence of this issue, we’ll:
- Improve alerting on the occasions our sending services stop/restart to help us detect issues sooner.
- Determine why the initial change caused our sending services to stop/restart.
- Roll-out Application Performance Management software (APM) with anomaly detection. We’re in the final stages of a project to integrate an APM tool with our application which uses machine learning to detect anomalies in real-time and alert us.
- Improve our risk-mitigation guidelines we follow to reduce the impact of outages.
- Refine our stalled send monitoring tool logic to optimize detection of potentially stalled sends.