Summary of impact:
At approximately 14:30 UTC on 5th August 2022, we encountered an issue with one of our services which caused delays in sending SMS via Dotdigital. We restored services at 15:10 UTC on 5th August 2022.
Customers may have experienced delays with their SMS sends. No messages were lost however the API to submit messages was intermittently available.
The Dotdigital SMS system uses a messaging service to allow different processes to communicate. Prior to the incident this messaging service was updated with the installation of a new plugin and this action caused the messaging system to restart in an unplanned way. Due to the configuration of our messaging cluster the restart caused message queues to become out of sync with their failover queue. This situation persisted and eventually caused a degradation of service which prevented SMS traffic from being dispatched.
At 14:30 UTC alarms were triggered after an automated system detected a large batch of application errors. At 14:40 UTC errors occurred in our Communications API which prevented customers submitting new messages. The decision was taken to failover to our alternate disaster recovery site and normal service was restored at 15:10 UTC.
Investigations continued in our primary site and the fault with the messaging system was identified and addressed by following a standard restart procedure.
We will investigate migrating our messaging system from an older high availability method to a newer alternative. This will make the system less prone to failure from unexpected restarts.