Missing Pages and Forms
Incident Report for dotdigital
Postmortem

Summary of impact:

We deployed a change to components responsible for transferring messages between Dotdigital services as part of our fortnightly release cycle. This release began at 06:31 UTC on 13 October 2021 and was completed at 07:01 UTC. During this incident, customers may have experienced the following issues:

  • Campaign tracking events, web behavior tracking data and inbound mail may have failed to be processed
  • Saving campaigns and pages may have failed 
  • Programs may have failed to run during the incident, but were automatically rescheduled to a later time
  • Newly published pages were inaccessible during the incident, but became accessible after the issues were resolved.

Root Cause:

During our fortnightly release, we introduced a change to a component responsible for talking to messaging services of our cloud provider. This resulted in certain entities used in this process not being cached anymore, which caused excessive resource usage and throttling.

Mitigation:

A summarized timeline of events (all times in UTC):

  • 07:07: We were notified about the increased error rate in the services responsible for campaign tracking and reporting. We rolled back the affected services.
  • 07:26: We spotted the same issue with the services processing inbound mail and they were rolled back (in Region 1).
  • 08:37: We rolled back content processing, web behavior tracking and program processing services as they experienced the same issue.
  • 10:24: Customers reported inability to reach newly published pages and forms.
  • 11:23: We identified the issue as being related to the rollback of campaign reporting services made earlier.
  • 14:20: We patched a fix to campaign reporting and content processing services which made newly published pages and forms accessible.
  • 14:48: We patched the remaining Dotdigital services and the issue was fully resolved.

Next Steps:

Whilst the change which introduced this problem was logically correct, it didn’t perform sufficiently under production load. These types of problems are difficult to identify, but we’ll update our code review process to look for similar situations in future.

In addition, our strategy to version data held in cache caused a problem when some components of the system are running different versions. We’ll look for ways to improve this so components can be restored to a previous version without knock-on impacts.

Posted Oct 15, 2021 - 15:12 BST

Resolved
Everything is back on track now.
We’ll write a detailed report for the issue we’ve experienced today. It’ll be posted on here as soon as it’s ready (in a day or two).
Apologies about today’s mishap.
Posted Oct 13, 2021 - 15:45 BST
Monitoring
Good news: we’ve released a fix for the issue, and are seeing rapid improvements. Thanks for sticking with us while we sorted everything. We’ll keep a close eye on things and will post a final confirmation shortly when it’s 100% resolved.
Posted Oct 13, 2021 - 15:23 BST
Identified
Thanks for your patience.
We’re making some real progress on resolving the issue. Customers may be experiencing delays of up to 1 hour when publishing pages and forms. We’ll be back with another update very soon.
Posted Oct 13, 2021 - 12:41 BST
Update
We are continuing to investigate this issue.
Posted Oct 13, 2021 - 11:37 BST
Investigating
Hey, folks. We're receiving some reports from customers experiencing issues with Pages and Forms. We're investigating it now, and we'll post an update as soon as we can.
Posted Oct 13, 2021 - 11:36 BST
This incident affected: Europe - Dotdigital R1 (Europe - Surveys and Forms), Asia Pacific - Dotdigital R3 (Asia Pacific - Surveys and Forms), and North America - Dotdigital R2 (North America - Pages and Forms).