At 07:52 UTC on Wednesday 29th July 2020, we were alerted of a growing backlog of work in a background service. Shortly afterwards, we diagnosed a group of services were not running. At 10:15 UTC, we patched a fix for the issue and most workloads caught up quickly.
Customers may have experienced delays in some features executing, including:
We made alterations to the framework we use to run our services in fortnightly release. These changes included a section of code we use to declare high availability pairs of our services. Those services which can’t be run in a distributed manner acquire a lock. The change inadvertently set the name used to acquire this lock to the same value for all services. This meant all services attempting to acquire the service lock with the same name. This only succeeded for a single service, causing all other services to enter a passive state.
Once the issues had been identified, we removed the default lock name and added code to read the correct value from a configuration file.
In order to minimize the risk of further issues like this, we’ve: