Delays to some features
Incident Report for dotdigital
Postmortem

Summary of Impact

At 07:52 UTC on Wednesday 29th July 2020, we were alerted of a growing backlog of work in a background service. Shortly afterwards, we diagnosed a group of services were not running. At 10:15 UTC, we patched a fix for the issue and most workloads caught up quickly.

Customers may have experienced delays in some features executing, including:

  • Product Recommendations
  • Web Behavior Tracking
  • Salesforce Syncs
  • Email Sends containing coupon codes.

Root Cause

We made alterations to the framework we use to run our services in fortnightly release. These changes included a section of code we use to declare high availability pairs of our services. Those services which can’t be run in a distributed manner acquire a lock. The change inadvertently set the name used to acquire this lock to the same value for all services. This meant all services attempting to acquire the service lock with the same name. This only succeeded for a single service, causing all other services to enter a passive state.

Mitigation

Once the issues had been identified, we removed the default lock name and added code to read the correct value from a configuration file.

Next Steps

In order to minimize the risk of further issues like this, we’ve:

  • Improved our automated testing to cover this specific problem
  • Planned the removal of this process based locking.
Posted Jul 29, 2020 - 14:48 BST

Resolved
Following the fix we released earlier, everything is back to normal now. We’ll write a detailed description of what caused the issue. Check back here in a day or two for the report. We’re sorry for the interruption to your day.
Posted Jul 29, 2020 - 11:23 BST
Identified
Our teams have isolated the issue this side, and we’re now taking care of the problem. Thanks for your patience, everyone. Regular service will resume again very soon.
Posted Jul 29, 2020 - 10:46 BST
Investigating
We're investigating an issue with delays to some features. This includes coupon codes, chat notifications, program statistics and salesforce v1 imports. Sorry if you're affected, but our Tech Team are working as quickly as possible to resolve the issue and get things back to normal. We'll share another update shortly.
Posted Jul 29, 2020 - 10:31 BST