API services offline in Engagement Cloud - Europe
Incident Report for dotdigital
Postmortem

Summary of impact:

At approximately 03:26 UTC on Wednesday 7th November 2019, the European Engagement Cloud API went offline. All subsequent API requests failed until service was restored at 05:57 UTC. A second API outage occurred between 08:43 and 09:08 UTC.

While the API was offline, customers would’ve experienced failed data synchronizations and API calls when using:

  • eCommerce and CRM connectors
  • Custom Integrations
  • Partner connectors that use our API
  • Transactional Email using our API
  • Our core Engagement Cloud API functionality

All other global, CPaaS and regional services were unaffected.

Root Cause:

Major components of the Engagement Cloud region 1 (Europe) are hosted in Azure’s West Europe region. Microsoft Azure experienced multiple failures to storage scale units while they were adding additional capacity (see RCA from Microsoft Azure in the appendix below). A bug in Microsoft’s process resulted in an outage with its storage product, subsequently stopping ~15% of our European Virtual Machines used to run the Engagement Cloud. Virtual Machines powering many of our services were shutdown, but the high availability design of our clusters meant that most services continued to run with no customer impact. The only exception was the Engagement Cloud API which suffered a loss of all machines in the cluster.

Mitigation:

Our Service Operations Team was proactively alerted to the incident shortly after the failure occurred at 03:30 GMT. After an initial assessment, we posted an incident notification to our customers through our status page. All services are designed with high availability in mind, they are load balanced in clusters across multiple availability sets within a region. The design provides fault tolerance if a problem occurs within a single availability set. However, because the storage issues were affecting multiple availability sets and on a much larger scale, it meant it had affected all of our API servers within our European API cluster. Our team evaluated invoking DR plans to recover to a secondary region, however because most services were online, we decided to work on deploying additional API resources.

Our team continued to work alongside Microsoft to restore services back to normal for the affected virtual machines and by 05:57 UTC half of the API cluster was back in operation. As further availability sets came back online and as we added additional API resources, it caused a further disruption of our API cluster at 08:43 to 09:08 UTC.

Our team continued to monitor customer impact and restore the remaining 15% of affected Virtual Machines within our affected clusters. At 11:03 UTC, we were comfortable our API services had improved significantly with minimal customer impact being observed, so we subsequently closed our status page. We were back to full capacity with all Virtual Machines restored at 12:41 UTC.

Next Steps:

We’ll:

  • Review options for spreading Engagement Cloud resources across Azure’s availability zones (additional to the availability sets) and also additional regions within Europe.
  • Continue discussions with Microsoft on what went wrong with their process and how they plan to reduce the likelihood and impact if a similar event was to occur again.

Appendices: RCA from Microsoft Azure as of 8th November 2019 08:58 UTC

RCA - Storage - West Europe

Summary of Impact: Between 02:40 and 10:55 UTC on 07 Nov 2019, a subset of customers using Storage in West Europe experienced service availability issues. In addition, resources with dependencies on the impacted storage scale units may have experienced downstream impact in the form of availability issues, connection failures, or high latency.

Root Cause: Every Azure region has multiple storage scale units that serve customer traffic. We distribute and balance load across the different scale units and add new scale units as needed. The automated load-balancing operations occur in the background to ensure all the scale units are running at healthy utilization levels and are designed to be impactless for customer facing operations. During this incident, we had just enabled three storage scale units to balance the load between them, to keep up with changing utilization on the scale units. A bug in this process resulted in backend roles crashing across the scale units participating in the load balancing operations, causing them to become unhealthy. It also impacted services dependent on storage in the region.

Mitigation: Engineers mitigated the impact to all but one scale unit by deploying a platform hotfix. Mitigation to the remaining scale unit was delayed due to compatibility issues identified when applying the fix but has since been completed.

Next Steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • The fix has already been deployed to the impacted scale units in the region. We are currently deploying the fix globally to our fleet.
  • We have been performing cross-scale-unit load-balancing operations numerous times before without any adverse effect. In the wake of this incident, we are reviewing our procedures, tooling and service again for such load balancing operation. We have paused further load balancing actions in this region until this review is completed.
  • We are rigorously reviewing all our background management processes and deployments to prevent any further impact to customers in this region.
  • We are reviewing our validation procedures gaps to catch these issues in our validation environment.
Posted Nov 08, 2019 - 14:56 GMT

Resolved
We’ve monitored the situation for some time now and we’re confident this issue is fully resolved. We'll continue to keep a close eye on things.

We’re going to write a report to share the specific details of today’s issue with our cloud service provider. We’ll attach it here in a few days when it’s ready.

Sorry about the interruption this issue caused.
Posted Nov 07, 2019 - 11:03 GMT
Update
We're pleased to say our services are continuing to operate normally. We're still monitoring the situation and working with our cloud provider. We'll provide another update at 11:00 (UTC) or sooner if we have news to share.
Posted Nov 07, 2019 - 10:05 GMT
Monitoring
All services are currently online. We continue to monitor and add additional capacity in reaction to an outage with our cloud provider.
Posted Nov 07, 2019 - 08:43 GMT
Update
We are continuing to work on a fix for this issue.
Posted Nov 07, 2019 - 08:18 GMT
Update
Our Cloud Service provider are still working through mitigation steps.
All Engagement Cloud Services in Europe are operational but customers may experience some degraded performance until the issue is fully resolved.
Posted Nov 07, 2019 - 07:34 GMT
Update
The issue with our cloud service provider is still ongoing. They are actively working on mitigation steps to fix this.
Further updates will be provided when available.
Posted Nov 07, 2019 - 06:45 GMT
Update
The Incident with our cloud service provider is still ongoing. We are in communication with them to establish their mitigation steps.
Further updates will be provided when available.
We apologise for any inconvenience this may cause.
Posted Nov 07, 2019 - 05:02 GMT
Identified
An issue has been identified with our Cloud Service provider. They are currently experiencing issues with disk storage associated to a sub set of our infrastructure.
We are in contact with them and will provide updates in due course.
Posted Nov 07, 2019 - 04:14 GMT
Investigating
We are currently experiencing issues with multiple services with Engagement Cloud in our European Region. Our teams are investigation the issue and we will provide an update in due course.
Apologies for any inconvenience.
Posted Nov 07, 2019 - 04:00 GMT
This incident affected: Europe - Engagement Cloud r1 (Europe - API, Europe - Surveys and Forms, Europe - Transactional Email).