Application slowness and delays to email campaign reporting in Europe (Region 1)
Incident Report for dotdigital
Postmortem

Summary of impact:

At approximately 09:58 UTC on Thursday 9th September 2020, we experienced slowness across our platform and delays to email campaign reporting as a result of network connectivity and latency issues with our third party cloud service provider in Europe. Normal service resumed at 13:57 UTC on Thursday 9th September 2020. 

Customers may have experienced some of the following issues:

  • Delays in collecting campaign reporting data
  • Problems with split test campaigns
  • Slow page load times when using their account or interacting with sent emails and SMS
  • Delays in email and SMS sends

Root Cause:

Our third party cloud provider suffered a partial network outage in their West Europe region. Whilst traffic was rerouted to alternate paths these become overloaded and this resulted in high network latency and timeouts. For further detail please see an excerpt of the RCA from our cloud provider:

Summary of impact: Between 09:30 (approx.) and 17:32 UTC on 03 Sep 2020, you were identified as a customer who may have experienced intermittent latency or issues connecting to resources hosted in West Europe. Retries may have worked during this time frame. 

Preliminary root cause: We determined that a localized fiber-cut event had occurred in the vicinity of one of our West Europe Datacentres. Azure Networking monitoring automatically re-routed traffic to reduce the potential for impact to customers. However during this traffic redirect, a subset of the alternate paths became heavily utilized, resulting in latency and timeouts for some customers. 

Mitigation: We optimized traffic across the various paths in the West Europe region, and manually brought additional capacity online to help quickly reduce the congestion and restore normal levels of service. Engineers continued to monitor for an extended period to ensure all network latency in the region has been restored.

Mitigation:

The timeline for resolving this issue was:

  • 9:58: We began investigation into why there were delays to email campaign reporting and discovered that a backlog was building. We began the initial triage and found network latency and timeouts were occurring between our servers and message queue services and also found degraded performance on some of our servers
  • 10:08: We restarted our services that process email campaign reporting and monitored
  • 10:38: Posted to status page whilst we continued our investigation
  • 11:00: We continued to see degraded performance processing message queues and decided to redeploy the affected servers as a potential fix
  • 11:14: Through further investigation we found that the issue seemed to lie with resources hosted by our cloud provider specifically in the West Europe region. We made changes to fail over to our secondary messaging queue in another European region
  • 11:31: After monitoring following the fail over we noticed improvement in processing of message queues in our secondary
  • 11:36: We started scaling up and provisioning additional resources in preparation for processing the backlog that had accrued. We also started to move any backlogged messages still on our primary message queue to the secondary
  • 11:40: We began restarting all of our services and websites in Europe to ensure all would use our secondary message queue
  • 12:15: All services had been restarted and we began monitoring
  • 12:20: Additional workers had completed provisioning and began processing the backlog of messages
  • 12:41: We saw issues with access to cloud storage in West Europe and stopped all sending services to mitigate issues with email & SMS sending
  • 12:54: Storage issues cleared and we slowly began resuming sending services whilst monitoring for stability
  • 13:15: Sending services were brought fully back online
  • 13:57: We had resolved the issue and were happy that normal service had resumed while the remainder of the email campaign reporting backlog was being processed and catching up

Next Steps:

We are working with our cloud provider to ensure they are taking appropriate action. In addition we have already identified a roadmap of improvements to utilise multiple sites within the European region.

Posted Sep 04, 2020 - 16:52 BST

Resolved
We’ve monitored the situation for some time now.
The send backlog has already completely cleared. The email campaign reporting is still updating and will finish catching up shortly.
Sorry it happened and for the interruption it caused. We’re going to write a report to share the specific details of today’s issue. We’ll attach it here in a few days when it’s ready.
Posted Sep 03, 2020 - 14:57 BST
Monitoring
We’ve now resumed sends and are seeing our send backlog begin to clear. In addition, our email campaign reporting is beginning to update too. We’ll continue to monitor things from here on. More news to follow soon.
Posted Sep 03, 2020 - 14:11 BST
Update
We're continuing to work on a fix for this issue. We've temporarily paused email sends (in Europe/region 1) while we work with our third-party cloud provider to resolve the issue. We'll provide another update soon. Sorry for the disruption this is causing.
Posted Sep 03, 2020 - 13:41 BST
Identified
Thanks for your patience. We’re making some real progress on resolving this issue. We’ll be back with another update very soon.
Posted Sep 03, 2020 - 12:20 BST
Investigating
We're investigating an issue with delays for email campaign reporting. Sorry if you're affected, but our tech team are working as quickly as possible to resolve the issue and get things back to normal. We'll share another update shortly.
Posted Sep 03, 2020 - 11:37 BST
This incident affected: Europe - Engagement Cloud r1 (Europe - Mail Sending, Europe - Reporting).