API in R3 region is down
Incident Report for dotdigital
Postmortem

RCA - Engagement Cloud - APAC API Down

Summary of impact:

At approximately 14:40 BST, Thursday 24th October 2019 (01:40 AEDT, Friday 25th October 2019) the APAC Engagement Cloud API service went offline. All subsequent API requests failed until service was restored at 15:10 BST (02:10 AEDT, Friday 25th October 2019). All other global and regional services were unaffected.

Root Cause:

As part of our continuous platform reviews and seasonal preparations more APAC API resources were being deployed ahead of forecasted usage.

Whilst the update process increased available API resource successfully, it also ran a new IP assignment to the API load balancer. During the IP swap all internet traffic to the API service was terminated by the load balancer.

The incorrect IP assignment was possible because our deployment process has two modes. It can be used to deploy additional resources to an existing application or in a disaster recovery scenario launch a new installation which assigns a new IP address. Human error lead to the wrong mode being applied.

Mitigation:

The fault was identified before the update completed, therefore, we were able to immediately revert the load balancer to its original configuration.

Next Steps:

We will remove the requirement for our engineers to input parameters to this procedure and existing infrastructure will dictate the safest mode of operation.

Posted Oct 28, 2019 - 14:17 GMT

Resolved
The problem has been found and fixed. Our API is fully functional again. We apologise for the inconvenience this has caused.
Posted Oct 24, 2019 - 15:15 BST
Investigating
Our API in the R3 region is currently down which we are investigating as a priority. We will post further updates shortly.
Posted Oct 24, 2019 - 15:10 BST
This incident affected: Asia Pacific - Engagement Cloud r3 (Asia Pacific - API).