Intermittent Access to Engagement Cloud - All Regions
Incident Report for dotdigital
Postmortem

RCA: Intermittent Access to Engagement Cloud

Summary of impact:

At approximately 13:00 BST on Tuesday 25th June 2019 websites for all Engagement Cloud regions became inaccessible from various regions of the Internet.

Root cause:

Our engineers soon isolated the problem to Cloudflare which appeared to be unreachable. Cloudflare are a popular security and CDN product who sit in front of our websites, they also noticed the disruption and made an announcement on their status page.

Cloudflare identified the cause of the outage as a BGP leak which was triggered by a series of mis-configuration events at two US based companies one of which being Verizon. BGP or Border Gateway Protocol is mechanism used by networks to transfer data, it's this exchange which enables individual networks to join to form the Internet. In the event of bad BGP configuration being released it’s possible for traffic intended for one network to be sent to another. In this case data intended for Cloudflare’s network was being routed to a small US company who couldn’t handle the increased load. The same problem affected many other Internet services including Amazon. Cloudflare have posted more information about this incident here.

Mitigation:

At 13:40 BST traffic to Cloudflare and our platforms was restored after the network responsible corrected their mistake.

BGP problems are extremely difficult to defend against. It’s an inherent flaw within the systems making the Internet possible which allow a network owner to potentially cause disruption for other networks. We rely upon the larger Internet carriers to monitor for and ignore these mis-configurations and in this case this did not happen.

Posted 4 months ago. Jun 27, 2019 - 13:02 BST

Resolved
This issue is now fully resolved
The root cause was with a network configuration error with one of the providers used by our 3rd party CDN (Content Distribution Network) provider, Cloudflare.
It affected a number of websites on the internet, not just our services.
We apologise for any delay experienced.
Posted 4 months ago. Jun 24, 2019 - 14:05 BST
Monitoring
Our 3rd party provider has identified and fixed the problem, we will continue to monitor and will close this incident once we're confident it has been fixed, although our preliminary monitoring shows it should be fixed now.
Posted 4 months ago. Jun 24, 2019 - 13:44 BST
Identified
One of our 3rd Party providers, Cloudflare, are currently experiencing intermittent issues.
We are communicating with them to establish when the issue will be fixed.
A further update will be provided in due course.
Posted 4 months ago. Jun 24, 2019 - 13:15 BST
This incident affected: Europe - Engagement Cloud r1 (Europe - Web Application, Europe - API, Europe - Open and Link Tracking, Europe - Reporting, Europe - Surveys and Forms, Europe - Landing Pages), North America - Engagement Cloud r2 (North America - Web Application, North America - API, North America - Open and Link Tracking, North America - Reporting, North America - Surveys and Forms, North America - Landing Pages), Asia Pacific - Engagement Cloud r3 (Asia Pacific - Web Application, Asia Pacific - API, Asia Pacific - Open and Link Tracking, Asia Pacific - Reporting, Asia Pacific - Surveys and Forms, Asia Pacific - Landing Pages), Global - Website, Global - Login Page, Global - Image Hosting, and Global CPaaS (Portal, SMS Portal).