Intermittent errors accessing R1 API and other services
Incident Report for dotdigital
Postmortem

Summary of impact:

At approximately 14:50 UTC on Monday 27th July 2020, 1 of our primary database servers was experiencing severe blocking on one of our client databases. This caused application timeouts and errors as a result. We restored services at 16:00 UTC on Monday 27th July 2020. Some customers may have experienced:

  • API Timeouts
  • Some pages hanging or being slow to load.
  • Timeout messages/errors on various Engagement Cloud pages.

Root Cause:

A maintenance process ran on one of our largest client databases. One of the tables within this database was not configured correctly. When the maintenance process ran it was attempting to sort/order half a billion data rows. To do this our back-end database system had to block access to the table while it carried out its maintenance. This had a knock on effect of blocking some of our most important tables.

Mitigation:

We located the process that was running on the server causing the blockage. We cancelled the maintenance process at 15:10 UTC. We had to wait for our back-end database system to rollback the changes to ensure the table was in a consistent state, this took approximately 50 minutes

Next Steps:

  • For the time being, the maintenance process has been turned off for this specific shard.
  • In the first instance, we’ll need to configure the table in question correctly and ensure the configuration is successful.
  • Secondly, we’ll align this maintenance process with our other maintenance process and ensure there is a check on how much data will be processed. The new checking process will identify any issues and prevent the maintenance occurring. In addition, it’ll prompt investigation by our team.
Posted Jul 28, 2020 - 10:20 BST

Resolved
Everything is back on track now. We’ll write a detailed report for the issue we’ve experienced today. It’ll be posted on here as soon as it’s ready (in a day or two). Apologies about today’s mishap.
Posted Jul 27, 2020 - 17:05 BST
Identified
Our teams have isolated the issue this side, and we’re now taking care of the problem. The issue is isolated to a single database and therefore only a subset of customers are being impacted. Thanks for your patience, everyone. Regular service will resume again very soon.
Posted Jul 27, 2020 - 16:27 BST
Investigating
Hey, folks. We're receiving some reports from a subset of customers experiencing intermittent errors when accessing our API or using other services such as imports and segments. We're investigating it now, and we'll post an update as soon as we can.
Posted Jul 27, 2020 - 16:20 BST