Sending Issues - Engagement Cloud North America
Incident Report for dotdigital
Postmortem

Summary of impact:

At approximately 16:30 UTC on Thursday 20th February 2020, Engagement Cloud customers in our North American region experienced delayed email and SMS campaign sends. This would’ve included delays to transactional email sends too. We fully restored services at 19:42 UTC.

Root Cause:

After in depth investigation, we were able to isolate the problem to the internet connectivity in our North American sending node data center. This sending node provides much of the processing required to personalize and dispatch email and SMS sends.

In the data center we have 2 internet circuits from separate internet transit suppliers to provide redundancy in the event of a failure on one of the connections. These are configured in an active/active way for maximum throughput and resilience in an event of a failure. In addition, we have multiple connections to Microsoft Azure for enhanced resilience.

We carried out tests and found both connections were experiencing issues trying to communicate with core internet services and our cloud infrastructure in Microsoft Azure. In addition, we discovered there were some physical cabling issues in the data center that may have been impacting the performance of one of our internet connections.

Mitigation:

Here’s a timeline for our mitigation steps (times shown in UTC):

  • 16:35 - We carried out tests and found we were experiencing intermittent connectivity issues in our North American data center. This was causing services to generate multiple errors while not being able to access key resources required for personalization and dispatch
  • 16:45 - We stopped sending services
  • 16:50 - We contacted the data center and our Internet transit providers to initiate support and alert them to the issues
  • 17:20 - We were notified by one of our Internet transit providers they were currently investigating a possible interruption which may be impacting our service
  • 17:40 - We restarted our gateway connectivity to Microsoft Azure
  • 17:55 - We discovered there were potential cabling issues with one of our transit circuits
  • 18:00 - We carried out further tests on both internet connections and found that one of our connections was experiencing intermittently issues whilst the other was experiencing increased latency over our gateway to Microsoft Azure.
  • 18:30 - We fixed the cabling issues and re-seated the connectors for our internet circuit
  • 18:41 - We restarted our gateway to Microsoft Azure a second time
  • 18:50 - We brought all circuits back online and found that both connections were now stable
  • 19:10 - We resumed sending services and continued to monitor the service
  • 19:40 - We were confident the connectivity was stable and updated our customers through our status page and support ticket.

Next Steps:

We’ll:

  • Make adjustments to our monitoring in order to receive alerts earlier which will enable us to react quicker if we ever have a similar issue in the future
  • Engage with network carriers to look at the cleaning their circuits in the data center
  • Continue our discussions with our 3rd party providers and analyze further log data. We’ll update this RCA if more information is found on a specific root cause.
Posted Feb 21, 2020 - 19:09 GMT

Resolved
We’ve monitored the situation for some time now and we’re confident this issue with sends in our North America region is fully resolved.
Sorry it happened and for the interruption it caused. We’re going to write a report to share the specific details of today’s issue. We’ll attach it here in a few days when it’s ready.
Posted Feb 20, 2020 - 19:42 GMT
Monitoring
We’re bringing sends back online now, so expect to see sends complete successfully again. We’ll continue to monitor things from here on. More news to follow in 45 minutes.
Posted Feb 20, 2020 - 19:08 GMT
Update
Our teams have isolated the issue to be an internet failure at one of our US data center providers, and we’re now taking care of the problem.
Thanks for your patience, everyone. We'll provide another status update in 45 minutes.
Posted Feb 20, 2020 - 18:23 GMT
Identified
We've identified a network related issue in our North American data centre. We're working with our 3rd party providers in order to resolve the issue as quickly as possible. Customers in our North American region will experience some delays in their emails sends completing. We'll provide further updates in due course. We're sorry for the inconvenience.
Posted Feb 20, 2020 - 17:28 GMT
Investigating
We are currently experiencing Email sending issues in our North America region.
This is currently being investigated by our teams, an update will be provided in due course.
Posted Feb 20, 2020 - 16:45 GMT
This incident affected: North America - Engagement Cloud r2 (North America - Mail Sending, North America - Transactional Email).