Summary of impact:
At approximately 13:35 UTC on 11th Dec 2022, SMS sends from Dotdigital began to experience delays. We restored services at 22:10 UTC the same day.
Customers may have experienced some of the following issues:
Our SMS platform has multiple software components, that need to coordinate and communicate with each other. To ensure we can scale and distribute work correctly, we use a “queuing” pattern, and we use two technologies for this; message queues, and database queues.
The incident stemmed from our use of database queues. In a database queue, a component writes messages to a database table, and another component retrieves the messages and deletes them from the database once they have been handled. The database tables have a column that keeps track of the event ID, but this column can only go up to a maximum value which is constrained by the database. Once it reaches this value, it cannot accept new data.
We have a standard operating procedure to detect when these tables are reaching their maximum value, and to reset back to zero. However, due to a code flaw, the alerting to tell us that these tables were reaching their maximum values, didn’t work. Meaning when the maximum value was reached, message sending stopped, until such time as we manually intervened.
The timeline (in UTC) for resolving this issue was: