Summary of impact:
From approximately 11:24 UTC on Thursday 19th March 2020, some Engagement Cloud customers in our Asia Pacific Region may have noticed intermittent failures with many functions/services, including:
Most operations were retried successfully. However, as the underlying problem increased in frequency, users began reporting problems from approximately 02:30 UTC on Friday 20th March 2020.
We identified a fault with the messaging service supplied by our cloud provider. The messaging service allows different components within Engagement Cloud to share data and carry out requests.
From approximately 04:15 UTC on Friday 20th March 2020, services returned to a stable state after the faulty component was removed.
Engagement Cloud relies upon a messaging service supplied by our cloud provider. We use two instances of this service in a highly available configuration to help manage intermittent failures. However, in this case one of the messaging instances failed completely and wouldn’t accept any connection attempts. This had severe knock effects on our systems.
Once the faulty messaging service was identified it was manually removed from duty and our service returned to normal from 04:15 UTC on Friday 20th March 2020.
We’re working with our cloud provider to understand the root cause of the failure in their service. In addition, we have increased the capabilities of our messaging service load balancing routines. This adjustment will remove the impact from a fully unavailable message service and is currently being rolled out over the next few weeks to each of our hosting regions.