Delays and intermittent errors sending SMS
Incident Report for dotdigital
Postmortem

Summary of impact:

At approximately 11:03 UTC on Tuesday 24th December 2019, we experienced delays in sending SMS. We restored services back to normal at 11:59 UTC on Tuesday 24th December 2019. Customers may have experienced delays in:

  • Dispatching SMS via the CPaaS API
  • SMS sends via Engagement Cloud.

Root Cause:

The delays happened because our API servers experienced a sudden surge in requests. As a result, high CPU on all our API servers lead to intermittent errors when connecting to our database servers. These factors combined to reduce our throughput and lead to messages being delayed.

Mitigation:

Our team were proactively alerted to the incident when the initial the failure occurred at 11:03 UTC. After an initial assessment, we posted an incident notification to our customers through our status page. We allocated increased resources to our API cluster based on our expected load, plus additional resources for any overhead. After identifying the issue, our team took action to deploy additional resources to the cluster to handle the load and increased the flow of SMS sends.

Our team continued to monitor the issue and at 11:59 UTC we were confident there were no delays in SMS sends and services were running as normal. Shortly after, we closed our status page.

Next Steps:

We’re really sorry this incident occurred and for any disruption it caused. We’re continually reviewing our services and making adjustments to ensure our infrastructure can handle an increased workload to prevent further incidents. With that in mind, we’ll:

  • Continue to monitor the load across the API cluster to ensure that this can process the amount of requests we receive
  • Regularly review and adjust the available resources for the API cluster.
Posted Jan 06, 2020 - 15:14 GMT

Resolved
This incident has been resolved.
Posted Dec 24, 2019 - 11:59 GMT
Monitoring
Additional capacity has been added and services are now running normally.
Posted Dec 24, 2019 - 11:52 GMT
Identified
We're investigating high CPU on our API cluster and are taking action to reduce load. Customers will notice errors rates reducing as service returns to normal.
Posted Dec 24, 2019 - 11:17 GMT
Investigating
We are investigating reports of delays and intermittent errors in our CPaaS API for dispatching SMS.

This could also delay SMS broadcast sends via Engagement Cloud.
Posted Dec 24, 2019 - 11:03 GMT
This incident affected: Global CPaaS (API, SMS API).