Issues with Engagement Cloud (in Europe)
Incident Report for dotdigital
Postmortem

Summary of Impact

At approximately 12:10 UTC on 13th October 2020, we experienced delays and failures with a number of services, including campaign sends in our European region (Region 1). We restored normal services at 12:45 UTC on 13th October 2020, but a number of services experienced delays up until 15:55 UTC on 13th October 2020. Some campaign sends didn’t complete successfully. We have contacted the affected customers directly to advise they’ll need to re-issue their sends.

Root Cause

This issue was caused by a misconfiguration in the messaging system, which occurred during a change to normalize our message queue addresses. The sent messages were stored incorrectly or failed to deliver, which caused corresponding services to fail.

Mitigation

The timeline for resolving the issue was (times stated in UTC):

12:10: We changed our messaging system configuration. The sent messages are stored, but are unavailable for receiving services

12:20: Messaging system configuration changed again. The message sending stopped working

12:33: We identified parts of our infrastructure were unavailable

12:35: We identified the issue was related to the messaging system

12:45: We reverted our messaging system configuration to the original value

15:55: The incorrectly stored messages were re-delivered, but some customers sends needed to be restarted or re-issued. Our Support Team conducted proactive reach outs to affected customers.

Next Steps

In our next major platform release, we’ll include a change to prevent incorrect configuration from being applied.

Posted Oct 14, 2020 - 14:18 BST

Resolved
We experienced delays and failures with a number of services, including campaign sends in our European region (Region 1).
Posted Oct 13, 2020 - 13:00 BST