Summary of impact:
On Friday 31st January, we experienced 2 issues:
- At approximately 10:00 UTC, our support team began receiving reports from a small number of users experiencing TLS certificate errors when accessing our image library. This prevented them from managing their image library and amending images within EasyEditor. This impacted all Engagement Cloud regions.
- At approximately 13:09 UTC, some Engagement Cloud services in Europe experienced a degrade in performance. Most significantly this impacted email tracking, SMS tracking, unsubscribing and landing pages. At approximately 15:53 UTC service was fully restored.
- The TLS errors seen when accessing our image library occurred when end user security software incorrectly categorised our image domain as malicious. The small number of users with this security software were directed to a different IP address to display an access denied page. It’s likely the end user security software reacted to a legitimate DNS change made the previous day. The image library domain, r1-scaler.ddglib.com, was altered to a CNAME record in response to a denial of service attack on our DNS infrastructure. Although this is a valid change, it appears to have triggered the missclassification.
- The unrelated performance degradation impacted all users of our European instance. The Engagement Cloud relies heavily on a messaging service provided by our cloud hosting supplier. This service allows many services within Engagement Cloud to exchange data for processing. At 12:04 UTC, we received intermittent errors from various systems indicating problems communicating with the messaging service. These problems worsened and eventually took our message tracking website offline.
- The CNAME DNS change was reverted and we contacted security software vendors in order to remove the block. End users would’ve noticed their access restored over the rest of the day.
- At 15:34 UTC, engineers replaced one of messaging services with a newly deployed instance and service was partially restored. To activate this change throughout Engagement Cloud numerous systems required a restart and at approximately 15:53 UTC service was fully restored.
We use multiple instances of our cloud provider’s messaging service in an active/active fashion. This is designed to mitigate the impact of service disruption should our cloud provider have an outage in a single location. However this mechanism didn't appear to perform as designed, in this specific scenario, when there was a failure of the messaging service. We believe this to be because the service didn't fully fail and therefore appear offline, instead the application continued to see the faulty messaging service to be online but in an inconsistent state. We will continue our investigations to understand this further and whether it was an issue with the secondary service bus or something else. We’re also working with our cloud provider to understand what went wrong with the communication to and/or the faulty messaging instance.
Update - 10 February 2020
Our cloud supplier have now confirmed we suffered a service outage in their messaging service. They have identified an internal component which failed to work properly after an internal security token expired.