Intermittent failures with Engagement Cloud in Asia Pacific Region
Incident Report for dotdigital
Postmortem

Summary of impact:

From approximately 11:24 UTC on Thursday 19th March 2020, some Engagement Cloud customers in our Asia Pacific Region may have noticed intermittent failures with many functions/services, including:

  • Sending campaigns
  • Viewing reporting data
  • Executing programs.

Most operations were retried successfully. However, as the underlying problem increased in frequency, users began reporting problems from approximately 02:30 UTC on Friday 20th March 2020.

We identified a fault with the messaging service supplied by our cloud provider. The messaging service allows different components within Engagement Cloud to share data and carry out requests.

From approximately 04:15 UTC on Friday 20th March 2020, services returned to a stable state after the faulty component was removed.

Root Cause:

Engagement Cloud relies upon a messaging service supplied by our cloud provider. We use two instances of this service in a highly available configuration to help manage intermittent failures. However, in this case one of the messaging instances failed completely and wouldn’t accept any connection attempts. This had severe knock effects on our systems.

Mitigation:

Once the faulty messaging service was identified it was manually removed from duty and our service returned to normal from 04:15 UTC on Friday 20th March 2020.

Next Steps:

We’re working with our cloud provider to understand the root cause of the failure in their service. In addition, we have increased the capabilities of our messaging service load balancing routines. This adjustment will remove the impact from a fully unavailable message service and is currently being rolled out over the next few weeks to each of our hosting regions.

Posted Mar 20, 2020 - 16:39 GMT

Resolved
Services in R3 are now stable. We will follow up with an incident report in the coming days.
Posted Mar 20, 2020 - 04:32 GMT
Identified
One of the messaging services provided by our cloud provider has become faulty. We are now in the process of failing over to a new instance and shortly users will see things return to normal.
Posted Mar 20, 2020 - 03:59 GMT
Investigating
We are currently experiencing some issues with the Engagement Cloud Asia Pacific Region
we are working to investigate the route cause
Posted Mar 20, 2020 - 03:25 GMT
This incident affected: Asia Pacific - Engagement Cloud r3 (Asia Pacific - Web Application, Asia Pacific - API, Asia Pacific - Mail Sending, Asia Pacific - Open and Link Tracking, Asia Pacific - Reporting, Asia Pacific - Surveys and Forms, Asia Pacific - Pages and Forms, Asia Pacific - Transactional Email).