Outbound SMS sending issues (All regions)
Incident Report for Dotdigital
Postmortem

Summary of impact:

At approximately 13:35 UTC on 11th Dec 2022, SMS sends from Dotdigital began to experience delays. We restored services at 22:10 UTC the same day.

Customers may have experienced some of the following issues:

  • Sending SMS via any method in the Dotdigital platform, or any of our APIs, were delayed, and messages may have gone out many hours later than intended
  • A small amount of inbound messages may also have been delayed 

Notes: 

  • Customers who send via SMPP were not impacted by these delays
  • All outbound/inbound SMS were successfully processed, no SMS were lost during this incident.

Root Cause:

Our SMS platform has multiple software components, that need to coordinate and communicate with each other. To ensure we can scale and distribute work correctly, we use a “queuing” pattern, and we use two technologies for this; message queues, and database queues.

The incident stemmed from our use of database queues. In a database queue, a component writes messages to a database table, and another component retrieves the messages and deletes them from the database once they have been handled. The database tables have a column that keeps track of the event ID, but this column can only go up to a maximum value which is constrained by the database. Once it reaches this value, it cannot accept new data.

We have a standard operating procedure to detect when these tables are reaching their maximum value, and to reset back to zero. However, due to a code flaw, the alerting to tell us that these tables were reaching their maximum values, didn’t work. Meaning when the maximum value was reached, message sending stopped, until such time as we manually intervened.

Mitigation:

The timeline (in UTC) for resolving this issue was:

  • 21:20 - Issue identified to be wide spread with queues building on the SMS infrastructure
  • 21:27 - On call team were alerted
  • 21:28 - Posted a status page
  • 22:09 - Cause of the outage was understood and we agreed on next steps
  • 22:12 - Table reset was performed, and message flow resumed
  • 23:48 - Status page closed

Next Steps:

  • Resolve the issue which prevented us from being alerted to a required reset
  • Review the database queue tables to find a more resilient approach that doesn’t require a reset
  • Configure additional alerts on any under-monitored queues
  • Introduce a better general error alerting system so that unanticipated system errors will be investigated immediately.
Posted Dec 13, 2022 - 10:27 GMT

Resolved
We’ve monitored the situation for some time now and we’re confident this issue is fully resolved.

Sorry, it happened and for the interruption it caused. We’re going to write a report to share the specific details of today’s issue. We’ll attach it here in a few days when it’s ready.
Posted Dec 11, 2022 - 23:46 GMT
Monitoring
We’ve released a fix, and outbound SMS sending resumed at 22:11 UTC. We are currently processing a backlog of messages. Please note that SMPP customers were not affected by this incident.

We’ll keep a close eye on things and will post a final confirmation shortly when it’s 100% resolved.
Posted Dec 11, 2022 - 22:47 GMT
Identified
Our teams have isolated the issue, and we’re now working on a resolution. Thanks for your patience, everyone.
Posted Dec 11, 2022 - 22:10 GMT
Investigating
We've discovered an issue with sending outbound SMS and our tech team are working to resolve the issue. Sorry if this is affecting you, but things should be back to normal very soon. Look out for more news from us shortly.
Posted Dec 11, 2022 - 21:28 GMT
This incident affected: Europe - Dotdigital R1 (Europe - SMS), North America - Dotdigital R2 (North America - SMS), and Asia Pacific - Dotdigital R3 (Asia Pacific - SMS).