Delays sending emails (All regions)
Incident Report for Dotdigital
Postmortem

Summary of impact:

From approximately 17.15 UTC on Thursday 5th January 2023 until 01:45 UTC on Friday 6th January 2023, a small amount of email sends temporarily stalled before they were successfully completed.

Root Cause:

We made a change in our APAC (R3) region to improve email link tracking. We monitored this change after deployment and observed it to be running as desired. It was subsequently released to our European (R1) and North America (R2) regions. Our initial observations and monitoring verified the change was behaving as expected.

Following on from those initial positive signs, we noticed our application logs were showing our sending services had been stopping/restarting periodically which had caused some sends to stall. We’re continuing our investigations into what caused these services to restart.

Mitigation:

The timeline (in UTC) for resolving this issue was:

  • 22:00 - Our monitoring system altered us to some email sends that hadn’t completed as expected and we raised an incident.
  • 22:33 - We spotted the root cause from our application logs as being our sending services periodically restarting.
  • 22:40 - We rolled back the early change in our European region (R1) and observed positive improvements.
  • 22:50 - We rolled back the early changes in our North American (R2) and APAC (R3) regions. And we confirmed our sending services returned to normal.
  • From 23:00 Thu 5th Jan to 01:45 Fri 6th Jan: We worked on identifying the stalled sends and resuming them.
  • 01:45 Fri 6th Jan - We had successfully resumed all stalled sends.

Next Steps:

In order to prevent a reoccurrence of this issue, we’ll:

  • Improve alerting on the occasions our sending services stop/restart to help us detect issues sooner.
  • Determine why the initial change caused our sending services to stop/restart.
  • Roll-out Application Performance Management software (APM) with anomaly detection. We’re in the final stages of a project to integrate an APM tool with our application which uses machine learning to detect anomalies in real-time and alert us.
  • Improve our risk-mitigation guidelines we follow to reduce the impact of outages.
  • Refine our stalled send monitoring tool logic to optimize detection of potentially stalled sends.
Posted Jan 06, 2023 - 16:03 GMT

Resolved
Following the change we made earlier, everything is back to normal now. And, any delayed sends have successfully completed. We’ll write a detailed description of what caused the issue. Check back here in a day or two for the report.
We’re sorry for the interruption to your day.
Posted Jan 06, 2023 - 00:59 GMT
Identified
We've made a change and are no longer seeing any new delayed sends.
We'll continue to monitor things from here. We're working on releasing a small number of email campaigns that have remained in a delayed state.
Posted Jan 05, 2023 - 23:59 GMT
Investigating
We are currently experiencing delays in sending emails via the Dotdigital platform. Our team is investigating and we'll update this status page soon with more information.
Posted Jan 05, 2023 - 22:17 GMT
This incident affected: Asia Pacific - Dotdigital R3 (Asia Pacific - Mail Sending), North America - Dotdigital R2 (North America - Mail Sending), and Europe - Dotdigital R1 (Europe - Mail Sending).