Image library, campaign saving and tracked link issues - Europe
Incident Report for dotdigital
Postmortem

Summary of impact:

On Friday 31st January, we experienced 2 issues:

  1. At approximately 10:00 UTC, our support team began receiving reports from a small number of users experiencing TLS certificate errors when accessing our image library. This prevented them from managing their image library and amending images within EasyEditor. This impacted all Engagement Cloud regions.
  2. At approximately 13:09 UTC, some Engagement Cloud services in Europe experienced a degrade in performance. Most significantly this impacted email tracking, SMS tracking, unsubscribing and landing pages. At approximately 15:53 UTC service was fully restored.

Root Cause:

  1. The TLS errors seen when accessing our image library occurred when end user security software incorrectly categorised our image domain as malicious. The small number of users with this security software were directed to a different IP address to display an access denied page. It’s likely the end user security software reacted to a legitimate DNS change made the previous day. The image library domain, r1-scaler.ddglib.com, was altered to a CNAME record in response to a denial of service attack on our DNS infrastructure. Although this is a valid change, it appears to have triggered the missclassification.
  2. The unrelated performance degradation impacted all users of our European instance. The Engagement Cloud relies heavily on a messaging service provided by our cloud hosting supplier. This service allows many services within Engagement Cloud to exchange data for processing. At 12:04 UTC, we received intermittent errors from various systems indicating problems communicating with the messaging service. These problems worsened and eventually took our message tracking website offline.

Mitigation:

  1. The CNAME DNS change was reverted and we contacted security software vendors in order to remove the block. End users would’ve noticed their access restored over the rest of the day.
  2. At 15:34 UTC, engineers replaced one of messaging services with a newly deployed instance and service was partially restored. To activate this change throughout Engagement Cloud numerous systems required a restart and at approximately 15:53 UTC service was fully restored.

Next Steps:

We use multiple instances of our cloud provider’s messaging service in an active/active fashion. This is designed to mitigate the impact of service disruption should our cloud provider have an outage in a single location. However this mechanism didn't appear to perform as designed, in this specific scenario, when there was a failure of the messaging service. We believe this to be because the service didn't fully fail and therefore appear offline, instead the application continued to see the faulty messaging service to be online but in an inconsistent state. We will continue our investigations to understand this further and whether it was an issue with the secondary service bus or something else. We’re also working with our cloud provider to understand what went wrong with the communication to and/or the faulty messaging instance.

Update - 10 February 2020

Our cloud supplier have now confirmed we suffered a service outage in their messaging service. They have identified an internal component which failed to work properly after an internal security token expired.

Posted Feb 04, 2020 - 10:50 GMT

Resolved
We've continued to see stability with the re-routing of traffic we applied earlier.

Customers who are experiencing issues relating to the image libraries, we're still in conversations with OpenDNS. We'll continue discussions with customers who've raised support tickets once we hear back from OpenDNS.

We’re going to create a detailed report to explain what happened today. If you look back here in a day or two, you’ll find it attached. Sorry about the interruption this caused today.
Posted Jan 31, 2020 - 17:47 GMT
Monitoring
Great news! The fix we put in place has restored platform stability and services are running normally.

We're still investigating issues from some customers where it seems OpenDNS is mis-classifying some of our domains, causing issues with displaying and/or editing of image libraries.

We’ll continue to monitor the situation from here on. More news to follow soon.
Posted Jan 31, 2020 - 16:49 GMT
Update
We appreciate your patience while we've been working on resolving this issue. We've identified the component causing the issue and we're beginning to divert traffic now. We'll review the results and post another update here very soon.
Posted Jan 31, 2020 - 15:58 GMT
Update
We are continuing to investigate this issue.
Posted Jan 31, 2020 - 15:14 GMT
Update
Thanks for your patience. We’re continuing to investigate and resolve these issues. We’ll be back with another update very soon. Sorry about the disruption.
Posted Jan 31, 2020 - 15:02 GMT
Update
We're seeing issues relating to our image library and some delays to:

- Web behaviour tracking
- Reporting data
- Saving campaigns.

Sorry if you're affected, but our tech team are working as quickly as possible to resolve the issue and get things back to normal. We'll share another update shortly.
Posted Jan 31, 2020 - 14:19 GMT
Update
We are continuing to investigate this issue.
Posted Jan 31, 2020 - 14:08 GMT
Investigating
In addition to the Image Library issues that are currently being investigated.
We are currently investigating issues with performance and delays with the European instance of Engagement Cloud
Further updates will be provided in due course
We apologise for any inconvenience this may cause
Posted Jan 31, 2020 - 11:33 GMT
This incident affected: Europe - Engagement Cloud r1 (Europe - Web Application, Europe - Open and Link Tracking, Europe - Surveys and Forms, Europe - Landing Pages) and Global - Image Hosting.