Summary of impact:
At approximately 16:30 UTC on Thursday 20th February 2020, Engagement Cloud customers in our North American region experienced delayed email and SMS campaign sends. This would’ve included delays to transactional email sends too. We fully restored services at 19:42 UTC.
After in depth investigation, we were able to isolate the problem to the internet connectivity in our North American sending node data center. This sending node provides much of the processing required to personalize and dispatch email and SMS sends.
In the data center we have 2 internet circuits from separate internet transit suppliers to provide redundancy in the event of a failure on one of the connections. These are configured in an active/active way for maximum throughput and resilience in an event of a failure. In addition, we have multiple connections to Microsoft Azure for enhanced resilience.
We carried out tests and found both connections were experiencing issues trying to communicate with core internet services and our cloud infrastructure in Microsoft Azure. In addition, we discovered there were some physical cabling issues in the data center that may have been impacting the performance of one of our internet connections.
Here’s a timeline for our mitigation steps (times shown in UTC):
- 16:35 - We carried out tests and found we were experiencing intermittent connectivity issues in our North American data center. This was causing services to generate multiple errors while not being able to access key resources required for personalization and dispatch
- 16:45 - We stopped sending services
- 16:50 - We contacted the data center and our Internet transit providers to initiate support and alert them to the issues
- 17:20 - We were notified by one of our Internet transit providers they were currently investigating a possible interruption which may be impacting our service
- 17:40 - We restarted our gateway connectivity to Microsoft Azure
- 17:55 - We discovered there were potential cabling issues with one of our transit circuits
- 18:00 - We carried out further tests on both internet connections and found that one of our connections was experiencing intermittently issues whilst the other was experiencing increased latency over our gateway to Microsoft Azure.
- 18:30 - We fixed the cabling issues and re-seated the connectors for our internet circuit
- 18:41 - We restarted our gateway to Microsoft Azure a second time
- 18:50 - We brought all circuits back online and found that both connections were now stable
- 19:10 - We resumed sending services and continued to monitor the service
- 19:40 - We were confident the connectivity was stable and updated our customers through our status page and support ticket.
- Make adjustments to our monitoring in order to receive alerts earlier which will enable us to react quicker if we ever have a similar issue in the future
- Engage with network carriers to look at the cleaning their circuits in the data center
- Continue our discussions with our 3rd party providers and analyze further log data. We’ll update this RCA if more information is found on a specific root cause.