Issues with Engagement Cloud (in North America)
Incident Report for dotdigital
Postmortem

Summary of impact:

At approximately 15:45 UTC on 14 June 2021, we saw intermittent errors when applications attempted to resolve hostnames via DNS. DNS lookups became reliable again at 23:39 UTC.

During this period, services remained online and the majority of workloads completed successfully. However, some customers may have noticed intermittent errors across all aspects of their account including the API, sends, imports and Programs.

Root Cause:

After investigation, we found DNS errors were occurring in applications hosted on a particular server. We’re continuing our work to understand why this server developed a fault. As all our applications are hosted on multiple servers, all other instances continued to work as normal.

Mitigation:

Our timeline for rectifying this issue was (times stated in UTC):

  • 21:23: We discovered a series of campaign sends failing to complete successfully
  • 21:29: We started our investigations and narrowed the problem to DNS resolution
  • 22:09 We increased capacity to our DNS servers as metrics showed some unusual workload patterns. Afterwards, we performed rolling restarts of all applications to clear the DNS cache. However, the errors continued
  • At approximately 23:30, we correlated the applications experiencing DNS errors to a single server and by 23:39 all applications running on the faulty server were migrated. At this point, customers would’ve stopped seeing errors and normal service resumed.

Next Steps:

Our applications are hosted on a cloud native service which aims to provide a reliable hosting environment. However, during this incident a component in this environment failed to work as expected. In conjunction with our cloud supplier, we’ll continue an investigation to identify the events leading to the server failure. Our current action items are:

  • Look to include the host metadata in our application logs. This would allows us to monitor the error rates per node and better understand the health of the node.
  • Although not directly impacting, all services that need to resolve external hostnames should use our approved list of external DNS resolvers. This helps to mitigate DNS pressure on our core Active Directory DNS service
Posted Jun 15, 2021 - 16:23 BST

Resolved
We’ve monitored the situation for some time now and we’re confident this issue is fully resolved.

We’ll write a detailed report for the issue we’ve experienced today. It’ll be posted on here as soon as it’s ready (in a day or two). Apologies about today’s mishap.
Posted Jun 15, 2021 - 01:28 BST
Monitoring
We applied a fix a few minutes ago. The great news is we're now seeing immediate and sustained improvements.

Thanks for your patience while we worked on this problem.

We'll monitor the situation from here and post a final update once we're totally satisfied it's resolved.
Posted Jun 15, 2021 - 01:01 BST
Identified
Our Tech Team has identified the cause to be DNS related, and we’re now working on a resolution. We’ll post another update in a few moments.
Posted Jun 15, 2021 - 00:45 BST
Update
We are continuing to investigate ongoing intermittent errors in our North American region. We will provide more information as soon as it is available.
Posted Jun 14, 2021 - 23:48 BST
Investigating
We're receiving some reports from customers experiencing some intermittent errors with our North American region. We're investigating it now, and we'll post an update as soon as we can.
Posted Jun 14, 2021 - 23:03 BST
This incident affected: North America - Dotdigital R2 (North America - Web Application, North America - API, North America - Mail Sending, North America - Open and Link Tracking, North America - Reporting, North America - Surveys and Forms, North America - Pages and Forms, North America - Transactional Email, North America - Contact Imports, North America - SMS, North America - Integration Hub).