Degraded performance with SMS, social and mobile messaging
Incident Report for dotdigital
Postmortem

RCA: Degraded performance with SMS, social and mobile messaging

Summary of impact:

At approximately 08:00 BST on Friday 28th June 2019 a backlog of work started to build as jobs queued to be processed by background services with our CPaaS platform. This lead to Engagement Cloud customers experiencing delays in both outbound and inbound message delivery across SMS, Facebook Messenger and Twitter DM channels. In addition some of our customers would have experienced intermittent problems accessing the CPaaS One API due to authentication failures.

Service was fully restored at 13:45 BST after engineers moved production workloads from our primary to our secondary site.

Root cause:

Engineers discovered services were intermittently failing to connect our message queuing cluster. This reduced system throughput and created the delays experienced by customers.

Mitigation:

Initially we suspected one of our three message queue servers was suffering under load and at 10:25 BST it was removed from service and scaled up. However the additional capacity did not improve the situation and triggered a problem with connecting services hanging due to the message queue service restarting. To rectify all services required a rolling restart to enable them to reconnect to the message queue.

Our message queue cluster continued to drop connections and at 13:45 BST we failed over to our secondary site and service was restored.

Work continued on our primary site to restore the message queue cluster to working order. At 19:00 BST 1st July 2019 a full message queue cluster restart was performed and the cluster became healthy again.

At 09:15 BST 2nd July 2019 service was restored to our primary region.

Next steps:

The time taken to identify the cause of this problem was increased because application errors related to message queue were not surfaced sufficiently. We also identified that the addition of a health check function for each micro service would be beneficial to establish the status of the system more quickly; rather than just log integration.

Services should handle message queue restarts more gracefully allowing us to regularly carry out message queue maintenance and version upgrades.

These areas will be fed back into the development & service operations teams as part of our continuous improvement plans.

Posted about 2 months ago. Jul 02, 2019 - 16:05 BST

Resolved
This incident has been resolved.
Posted about 2 months ago. Jun 28, 2019 - 14:53 BST
Monitoring
Following a fail-over to our secondary site, service has now been restored. If you continue to encounter issues accessing portal.comapi.com then please restart your web browser.
Posted about 2 months ago. Jun 28, 2019 - 14:04 BST
Investigating
We’re investigating intermittent failures from our CPaaS platform. Customers may experience delays in their SMS, social and mobile messaging.
Posted about 2 months ago. Jun 28, 2019 - 12:51 BST
This incident affected: Global CPaaS (API, Portal).