Summary of impact:
At approximately 03:26 UTC on Wednesday 7th November 2019, the European Engagement Cloud API went offline. All subsequent API requests failed until service was restored at 05:57 UTC. A second API outage occurred between 08:43 and 09:08 UTC.
While the API was offline, customers would’ve experienced failed data synchronizations and API calls when using:
All other global, CPaaS and regional services were unaffected.
Major components of the Engagement Cloud region 1 (Europe) are hosted in Azure’s West Europe region. Microsoft Azure experienced multiple failures to storage scale units while they were adding additional capacity (see RCA from Microsoft Azure in the appendix below). A bug in Microsoft’s process resulted in an outage with its storage product, subsequently stopping ~15% of our European Virtual Machines used to run the Engagement Cloud. Virtual Machines powering many of our services were shutdown, but the high availability design of our clusters meant that most services continued to run with no customer impact. The only exception was the Engagement Cloud API which suffered a loss of all machines in the cluster.
Our Service Operations Team was proactively alerted to the incident shortly after the failure occurred at 03:30 GMT. After an initial assessment, we posted an incident notification to our customers through our status page. All services are designed with high availability in mind, they are load balanced in clusters across multiple availability sets within a region. The design provides fault tolerance if a problem occurs within a single availability set. However, because the storage issues were affecting multiple availability sets and on a much larger scale, it meant it had affected all of our API servers within our European API cluster. Our team evaluated invoking DR plans to recover to a secondary region, however because most services were online, we decided to work on deploying additional API resources.
Our team continued to work alongside Microsoft to restore services back to normal for the affected virtual machines and by 05:57 UTC half of the API cluster was back in operation. As further availability sets came back online and as we added additional API resources, it caused a further disruption of our API cluster at 08:43 to 09:08 UTC.
Our team continued to monitor customer impact and restore the remaining 15% of affected Virtual Machines within our affected clusters. At 11:03 UTC, we were comfortable our API services had improved significantly with minimal customer impact being observed, so we subsequently closed our status page. We were back to full capacity with all Virtual Machines restored at 12:41 UTC.
RCA - Storage - West Europe
Summary of Impact: Between 02:40 and 10:55 UTC on 07 Nov 2019, a subset of customers using Storage in West Europe experienced service availability issues. In addition, resources with dependencies on the impacted storage scale units may have experienced downstream impact in the form of availability issues, connection failures, or high latency.
Root Cause: Every Azure region has multiple storage scale units that serve customer traffic. We distribute and balance load across the different scale units and add new scale units as needed. The automated load-balancing operations occur in the background to ensure all the scale units are running at healthy utilization levels and are designed to be impactless for customer facing operations. During this incident, we had just enabled three storage scale units to balance the load between them, to keep up with changing utilization on the scale units. A bug in this process resulted in backend roles crashing across the scale units participating in the load balancing operations, causing them to become unhealthy. It also impacted services dependent on storage in the region.
Mitigation: Engineers mitigated the impact to all but one scale unit by deploying a platform hotfix. Mitigation to the remaining scale unit was delayed due to compatibility issues identified when applying the fix but has since been completed.
Next Steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):