Delays sending SMS via Dotdigital
Incident Report for Dotdigital
Postmortem

Summary of impact:

At approximately 14:30 UTC on 5th August 2022, we encountered an issue with one of our services which caused delays in sending SMS via Dotdigital. We restored services at 15:10 UTC on 5th August 2022. 

Customers may have experienced delays with their SMS sends. No messages were lost however the API to submit messages was intermittently available.

Root Cause:

The Dotdigital SMS system uses a messaging service to allow different processes to communicate. Prior to the incident this messaging service was updated with the installation of a new plugin and this action caused the messaging system to restart in an unplanned way. Due to the configuration of our messaging cluster the restart caused message queues to become out of sync with their failover queue. This situation persisted and eventually caused a degradation of service which prevented SMS traffic from being dispatched.

Mitigation:

At 14:30 UTC alarms were triggered after an automated system detected a large batch of application errors. At 14:40 UTC errors occurred in our Communications API which prevented customers submitting new messages. The decision was taken to failover to our alternate disaster recovery site and normal service was restored at 15:10 UTC.

Investigations continued in our primary site and the fault with the messaging system was identified and addressed by following a standard restart procedure.

Next Steps:

We will investigate migrating our messaging system from an older high availability method to a newer alternative. This will make the system less prone to failure from unexpected restarts.

Posted Aug 09, 2022 - 15:26 BST

Resolved
We’ve monitored the situation for some time now and we’re confident this issue is fully resolved.
Sorry it happened and for the interruption it caused. We’re going to write a report to share the specific details of today’s issue. We’ll attach it here in a few days when it’s ready.
Posted Aug 05, 2022 - 17:18 BST
Monitoring
SMS delivery is now back to normal. We're monitoring the situation to confirm there is no recurrence of the issue.
Posted Aug 05, 2022 - 17:00 BST
Investigating
We are currently experiencing delays sending SMS via the Dotdigital platform. Our team are investigating and we'll update this status page soon with more information.
Posted Aug 05, 2022 - 16:16 BST
This incident affected: Global CPaaS (API), Europe - Dotdigital R1 (Europe - SMS), North America - Dotdigital R2 (North America - SMS), and Asia Pacific - Dotdigital R3 (Asia Pacific - SMS).