SMS Delays - All regions
Incident Report for dotdigital
Postmortem

RCA: SMS Delays - All Regions

Summary of impact:

At approximately 19:42 UTC on Tuesday 21st July 2020, the SMS components of our platform experienced networking errors leading to a partial outage of the system. We restored services at 20:13 UTC on Tuesday 21st July 2020. 

Customers may have experienced some of the following issues:

  • Delays with Incoming SMS, receipts and outgoing SMS sends 
  • Intermittent 500 errors returned for API calls during the outage (this will have also impacted chat functionality)
  • Intermittent access to the Engagement Cloud CPaaS Portal (https://apps-cpaas.dotdigital.com/)

Root Cause:

  • The storage layer attempted to fail-over due to an ongoing networking error
  • Fail-over didn’t complete successfully due to the ongoing issue and the fail-over-process was paused in an incomplete state. Manual intervention was required to bring services online.

Mitigation:

The timeline for resolving this issue was:

  • 19:42 UTC: We were alerted to networking issues on the SMS components of our platform and promptly began investigating.
  • 19:49 UTC: We discovered networking issues had lead to a partial fail-over of the storage layer. Manual intervention was required to restore services.
  • 20:05 UTC: We began taking remedial action by restarting our storage resources.
  • 20:13 UTC: We confirmed our storage resources were back up and running. Service returned to normal and we began monitoring to ensure all services remained stable.

Next Steps:

  • We have raised a case with Microsoft Azure to investigate the issue and work to mitigate in the future.
Posted Jul 23, 2020 - 11:01 BST

Resolved
We’ve monitored the situation for some time now and we’re confident this issue is fully resolved.
We were experiencing a high level of networking errors within our third party cloud provider which led to a fail over of some of our database servers.
Sorry it happened and for the interruption it caused. We’re going to write a full RCA report to share the specific details of today’s issue. We’ll attach it here in a few days when it’s ready.
Posted Jul 21, 2020 - 21:35 BST
Monitoring
All functionality is now restored.
We're keeping a watchful eye on things to make sure it stays that way. We'll let you know when we’re 100% confident everything is fully back to normal.
Posted Jul 21, 2020 - 21:17 BST
Investigating
We're investigating an issue with delayed SMS messages both from Engagement Cloud and our CPaaS/API products. Sorry if you're affected, but our tech team are working as quickly as possible to resolve the issue and get things back to normal. We'll share another update shortly.
Posted Jul 21, 2020 - 21:14 BST
This incident affected: Global CPaaS (API, SMS API), Europe - Engagement Cloud r1 (Europe - SMS), Asia Pacific - Engagement Cloud r3 (Asia Pacific - SMS), and North America - Engagement Cloud r2 (North America - SMS).