Summary of impact:
We deployed a change to components responsible for transferring messages between Dotdigital services as part of our fortnightly release cycle. This release began at 06:31 UTC on 13 October 2021 and was completed at 07:01 UTC. During this incident, customers may have experienced the following issues:
During our fortnightly release, we introduced a change to a component responsible for talking to messaging services of our cloud provider. This resulted in certain entities used in this process not being cached anymore, which caused excessive resource usage and throttling.
A summarized timeline of events (all times in UTC):
Whilst the change which introduced this problem was logically correct, it didn’t perform sufficiently under production load. These types of problems are difficult to identify, but we’ll update our code review process to look for similar situations in future.
In addition, our strategy to version data held in cache caused a problem when some components of the system are running different versions. We’ll look for ways to improve this so components can be restored to a previous version without knock-on impacts.