CPaaS delayed SMS and messages
Incident Report for dotdigital
Postmortem

Summary of impact:

At approximately 17:26 BST on Monday 30th October 2019, we began to see delayed messages within the CPaaS platform. During this period, any new messages via api.comapi.com and portal.comapi.com were being accepted but not being delivered. This included all types of messages, SMS, Facebook Messenger, Twitter and Native Push. We restored full service for any new traffic at 17:56 BST and backlogs were cleared at 18:45 BST.

Root cause:

The backlog was caused by an increase in load of 4x times the amount of usual traffic to our primary message queuing system over a 10 minute period. The increased load consumed more memory and triggered a mechanism which throttles the volume of message queue operations. This feature of our message queuing software is purposely designed to protect the system from overuse.

Mitigation:

We redirected traffic to our secondary site at 17:56 BST. This allowed any new requests to be delivered via separate infrastructure and gave our primary site time to catch up with the backlog.

At 18:45 BST the backlog of messages in our primary site had cleared. Following successful testing, we routed traffic back to our primary site.

Next steps:

We’re going to:

  • Alter the configuration of our message queue system to throttle requests at a greater threshold. This will prevent the system from taking action too early to protect itself.
  • Enhance monitoring and alerting of our message queuing system to alert us of these issues.
  • Improve our services to handle message queue connection errors better. This will reduce the time taken to restore throughput after a message queue disconnect.
  • Investigate alternative message queuing systems.
Posted 18 days ago. Oct 01, 2019 - 16:04 BST

Resolved
This incident has been resolved.
Posted 19 days ago. Sep 30, 2019 - 19:07 BST
Monitoring
Following a move to our secondary site full service has now been restored. Any queued messages have now been dispatched.
Posted 19 days ago. Sep 30, 2019 - 18:56 BST
Identified
We have identified the cause and our team are working on resolving. Messages are being queued and will go out once resolved.
Posted 19 days ago. Sep 30, 2019 - 18:29 BST
Investigating
We are currently experiencing delays in messages being sent out from our CPaaS platform, this is affecting customers who use the Engagement Cloud CPaaS portal and the "One" API. This also includes SMS messages being sent from the Engagement Cloud . Our engineers are working on the problem and more information will be available shortly.
Posted 19 days ago. Sep 30, 2019 - 18:07 BST
This incident affected: Global CPaaS (API).