Intermittent login issues, paused email/SMS sends and other services offline
Incident Report for dotdigital
Postmortem

Summary of impact:

From approximately 11:00 UTC on 29 July 2021, we had reports of intermittent user login failures in our North America (R2) and Asia Pacific (R3) regions. We were able to replicate this and could confirm our Europe (R1) region was unaffected.

As a precautionary measure, we paused message sending and services in all regions until we were able to confirm no data was being lost and could pinpoint the exact issue.

Root Cause:

After investigation, we found our partner Cloudflare (CF) had implemented a rollout of Edge Side Code (at approx. 08:00 UTC) across their platform. As it propagated globally, it affected the routing of our platform and white-label domain addresses throughout their network. Cloudflare later found many of their global Edge nodes were no longer able to run the Edge Side Code required.  Cloudflare posted information on their status page.

Mitigation:

Our timeline for rectifying this issue was (times stated in UTC):

  • 11:00: We received reports from a small group of customers who were experiencing login errors in R2/R3 regions.
  • 12:00: We reviewed all platform audit logs, SQL and Service Bus configurations for any issues or errors
  • 13:00: We confirmed our staging environments are experiencing the same intermittent login redirection to production
  • 13:50: We paused sending services in R3 to avoid any risk of data loss
  • 14:00: We rolled back of our webapps in R2/R3 was completed to rule out code related issues
  • 14:30: We purged the Cloudflare webapp domain cache to ensure this was not a caching issue
  • 15:15: We paused all services in all regions except for web services
  • 15:30: We conducted tests locally and on our staging environments which circumvented our Cloudflare services to confirm the third party partner issue
  • 16:00: We contacted Cloudflare support to help identify a custom hostname issue with one of our domains
  • 16:30: We brought R1 services back online once it is confirmed this region is unaffected
  • 17:00: Cloudflare support engage with us in tracing the root cause
  • 17:30: 

    • We made DNS changes to R2 and R3 to circumvent Cloudflare for some of our public endpoints. At this point, we saw a significant drop in error rates. 
    • We brought services back online in R2 and R3
    • We continued to monitor dropping error rates in all regions
  • 18:30: 

    • We replayed the same DNS changes to R1 endpoints
    • We continued to see error rates continue to drop in all regions as we waited for propagation
  • 21:00: 

    • Cloudflare confirmed a rollout of updated Edge Side Code had not been fully completed across their network
    • They update their status page
    • We could no longer see the error rates we had seen from earlier in the day
  • 00:00: Cloudflare confirms the issue has been resolved at their end.

Next Steps:

Cloudflare allows for our Branded and Custom From Addresses (CFA) to each have different origin servers based on the region they are created in. This information is stored as part of the metatdata of the CFA record in Cloudflare. 

Cloudflare use Edge Side Code (ESC) on their global network to intercept traffic from these CFA domains and make the necessary adjustments to the origin server routing decision. In the event that a CFA metadata does not contain an origin server or this ESC code does not execute, a fallback option is used, which in our case directs to R1 origin servers.

We understand from Cloudflare that as part of upgrades to their services, new metadata fields are now available that are now better served by their ESC and would remove the dependency we have for them to maintain our existing configuration. We’ve already engaged with their solution team on how to best migrate our existing CFA origin servers across to this in a way that’s transparent to our customers and will continue to deploy those changes.  

We’re sorry this incident occurred and we’re grateful for your patience while we worked on resolving it.

Posted Jul 30, 2021 - 17:48 BST

Resolved
We've pinpointed the issue with our third-party provider and have now resolved the issue. We’ve monitored the situation for some time now and we’re confident this issue is resolved.
Sorry it happened and for the interruption it has caused today. We’re going to write a report to share the specific details of today’s issue. We’ll attach it here in a few days when it’s ready.
Posted Jul 29, 2021 - 21:07 BST
Update
We've made some additional adjustments to redirect our network traffic and we've seen a significant reduction in login issues in R2 and R3 regions. We've noticed the ongoing networking issues causing occasional issues with saving and/or displaying images within campaigns too. We're continuing to monitor the situation and will provide another update soon.
Posted Jul 29, 2021 - 19:35 BST
Update
We're sorry to say we're seeing some intermittent login issues again in our North America (R2) and Asia Pacific (R3) regions. We're continuing with our work to fully resolve the issue.
Posted Jul 29, 2021 - 18:42 BST
Monitoring
We're pleased to say we've resolved the intermittent login issues in our North America (R2) and Asia Pacific (R3) regions. We've traced the issue to a third-party provider and we've since added mitigation. We'll continue monitoring the situation.
Posted Jul 29, 2021 - 18:12 BST
Identified
All services in Europe (R1) continue to remain stable.
All services have now been restored in our North America (R2) and Asia Pacific (R3) regions. We have a backlog of email/SMS sends to process which will complete quickly. However, some customers in R2 and R3 will be experiencing intermittent login issues still. We're continuing to address the login issues and will post another update soon.
Posted Jul 29, 2021 - 17:00 BST
Update
In Europe (R1) region, we've restored all services. We have a backlog of email/SMS sends to process which will complete quickly.
In North American (R2) and Asia Pacific (R3) regions, all services including all email/SMS sends, importing/exporting contacts, program enrollments and segment refreshes remain offline at present. However, we're aiming to bring services in these regions back online shortly.
We'll provide another update as soon as we have more news. Thanks for your patience while we've been working on resolving this issue.
Posted Jul 29, 2021 - 16:42 BST
Update
We've now paused some additional services while we resolve the intermittent login issue. To confirm, all sending activity (email and SMS test/campaign sends) are still paused and additionally importing/exporting contacts, program enrollments, segment refreshes, etc are currently unavailable. We're really sorry about the disruption. We're working as quickly as possible to restore everything to normal.
Posted Jul 29, 2021 - 15:10 BST
Update
We’re making good progress on resolving the intermittent login issues. All sending services (in all regions) remain in a paused state, meaning test sends and campaign sends will be held in a queue until we restart services (at which point they'll be sent). Sorry about the disruption, folks. We’ll be back with another update very soon.
Posted Jul 29, 2021 - 14:17 BST
Investigating
Hey, folks. We're receiving some reports from customers experiencing intermittent login issues. We've temporarily paused services while we investigate the issues. We'll post an update as soon as we can.
Posted Jul 29, 2021 - 13:23 BST
This incident affected: Global - Login Page, Asia Pacific - Engagement Cloud r3 (Asia Pacific - Mail Sending, Asia Pacific - Contact Imports, Asia Pacific - SMS, Asia Pacific - Email to SMS), North America - Engagement Cloud r2 (North America - Mail Sending, North America - Contact Imports, North America - SMS, North America - Email to SMS), and Europe - Engagement Cloud r1 (Europe - Mail Sending, Europe - Contact Imports, Europe - SMS, Europe - Email to SMS).