Delayed sends in APAC region
Incident Report for dotdigital
Postmortem

Summary of Impact

At approximately 14:34 UTC on Friday 9th July 2021, customers in our APAC region may have experienced delays in sending emails and SMS. We restored services to normal at 16:26 UTC.

Root Cause

This was caused by a 3rd party network related issue at our colocation data center in Sydney. We were forced to pause email and SMS sends in our APAC region in order to prevent any data loss.

*Update Tuesday 13th July 2021: The RCA from our third party supplier is as follows:
*
At 14:35 UTC on Friday the 9th of July, the data center in which your solution is hosted from suffered an upstream loss of network connectivity.
The cause was a fault in the redundant switching stack of our upstream network vendor, when a Layer 2 software process caused all network paths to stop forwarding network traffic.
Upstream network engineers arrived on site and reset the devices to rectify the issue at 15:49 UTC and services came back online around 14:08 UTC
The root cause of the issue is still under in depth investigation by our upstream vendor and their equipment supplier, and a likely outcome of this will be a scheduled configuration change to mitigate the issue or upgrade the underlying device firmware to affect the same outcome.
We deeply apologise for any loss that occurred as a result of this outage. We will be discussing the outcome of the above investigation ongoing until we are mutually comfortable the underlying issue has been properly mitigated from happening again.

Mitigation:

A summarized timeline of events (all times in UTC):

  • 14:36: We were alerted our sending nodes were unavailable and we started our investigation into the issue
  • 14:50: We raised a connectivity issue with our networking support partner
  • 14:54: We paused sending services
  • 15:03-15:10: We contacted our regional data center support partners via email and phone
  • 15:13: Our data center acknowledged connectivity issues affecting their colocation customers
  • 16:09: We observed the first signs of the 3rd party network issue being resolved, as we were able to connect to our equipment at the data center
  • 16:26: We restored sending services
  • 16:34: We confirmed email and SMS processing was back to normal and any backlog was cleared.

Next Steps.

  • Work is already in progress to review options for cloud based email sending infrastructure.
Posted Jul 12, 2021 - 13:31 BST

Resolved
This incident has been resolved.
Posted Jul 09, 2021 - 17:33 BST
Monitoring
Engineers have resolved the problem and service has been fully restored. Any sends initiated during this incident will now resume automatically.
Posted Jul 09, 2021 - 17:26 BST
Identified
A networking fault has been identified at our Sydney data center and engineers from our supplier are currently working to resolve the problem.
Posted Jul 09, 2021 - 16:18 BST
Investigating
We've discovered an issue which is preventing email and SMS sends, and our tech team are working flat out to fix it. Sorry if this is affecting you, but things should be back to normal very soon. Look out for more news from us shortly.
Posted Jul 09, 2021 - 15:57 BST
This incident affected: Asia Pacific - Engagement Cloud r3 (Asia Pacific - Mail Sending, Asia Pacific - Transactional Email, Asia Pacific - SMS).