Engagement Cloud - Europe API down
Incident Report for dotdigital
Postmortem

Summary of impact:

At approximately 16:21 BST on Monday 16th September 2019 the Engagement Cloud API in our European region went offline. All API requests failed until service was restored at 17:19 BST.

Root cause:

The outage was caused after a server configuration change was rolled out to our API servers. Whilst the update did make the desired change, it also removed web server settings which were thought to be redundant; it was these missing settings which took the API offline.

Mitigation:

After a review of the current and previous configuration files the missing settings were restored to bring the API back online.

Next steps:

The configuration update was successfully applied to our staging environments and our production Australian and US instances. However, our production European instance was configured uniquely to support a requirement which has since become redundant. This legacy configuration was removed via the update, but it was necessary in order to receive traffic from our load balancer. This dependency was difficult to identify and is unique to our European region.

The update was applied following our standard procedures, however testing was invalidated due to our staging environments not replicating the configuration used in production. This is a weakness which has already been identified and plans have been made to use the same cloud provider throughout our production and staging instances.

Posted about 1 month ago. Sep 17, 2019 - 11:03 BST

Resolved
The API issue has now been resolved.
We apologise for the inconvenience and thank you for your patience.
A full RCA will be available in the coming days.
Posted about 1 month ago. Sep 16, 2019 - 17:33 BST
Monitoring
Our teams have applied the fix. The API is now fully operational.
We will continue to monitor the situation for the next few minutes to ensure stability.
Posted about 1 month ago. Sep 16, 2019 - 17:29 BST
Identified
We have identified the issue and our teams are currently working on applying the fix.
Thanks for your continued patience.
Posted about 1 month ago. Sep 16, 2019 - 17:25 BST
Update
Thanks for your patience, folks. We’re making good headway on resolving the issue with our API (Europe region). We’ll be back with another update very soon.
Posted about 1 month ago. Sep 16, 2019 - 17:18 BST
Investigating
We are currently investigating an issue with the Engagement Cloud API in Europe.
Please note that this does not affect the CPaaS API, CPaaS SMS API or CPaaS Inbox API.
Updates will follow in due course.
We apologise for the inconvenience
Posted about 1 month ago. Sep 16, 2019 - 16:43 BST
This incident affected: Europe - Engagement Cloud r1 (Europe - API).