Delays in refreshing segments and running programs in our North America region (R2)
Incident Report for Dotdigital
Postmortem

Summary of Impact:

At approximately 10:14 UTC on Friday 6th May 2022, we detected issues communicating with Google BigQuery. This caused delays in refreshing segments and running programs in our North America region. The issue was resolved around 14:46 UTC on Friday 6th May.

Root Cause:

This was caused by an issue with multiple Google Cloud infrastructure components. The impact of the affected Google services can be seen here.

Mitigation:

Our timeline for events for this this issue was (times stated in UTC):

10:24: We were alerted to a problem in our Big Query data importer

10:31: We identified problems with syncing, causing delays in refreshing segments

11:00: We raised a support ticket with Google

11:04: We updated our status page to reflect delays in refreshing segments

11:44: Google acknowledged the issue affecting BigQuery and other GCP products

11:49: Google updated their status page to show it was affecting Google Compute Engine, Persistent Disk

13:09: Queue backlogs started processing

13:39: We changed our status page to ‘monitoring’ the issue

13:40: All transactional data queues were empty and all backlogs had been processed

14:46: After further checks, we declared the issue resolved

Next steps:

We depend on Google’s BigQuery service for some aspects of our functionality and have configured multi region support to increase availability. In this case a failure in a single region caused BigQuery to fail despite other US regions being online. We’re working with Google to understand why this occurred.

Posted May 10, 2022 - 17:44 BST

Resolved
We’ve monitored the situation for some time now and we’re confident this issue is fully resolved.
Sorry it happened and for the interruption it caused. We’re going to write a report to share the specific details of today’s issue. We’ll attach it here as soon as it’s ready.
Posted May 06, 2022 - 15:46 BST
Monitoring
We're pleased to say our cloud service provider has resolved their issue, meaning we're now seeing immediate and sustained improvements.
Thanks for your patience while this issue was resolved.
We'll monitor the situation from here and post a final update once we're totally satisfied it's resolved.
Posted May 06, 2022 - 14:39 BST
Identified
We have confirmed that the problem is due to an issue with one of our 3rd party cloud providers.
Users in our North America region may experience some slowness with other parts of our platform, in particular eCommerce dashboards, in addition to those already mentioned.
We'll post another update shortly.
Sorry for any inconvenience this is causing to your day
Posted May 06, 2022 - 13:07 BST
Investigating
We're investigating an issue with delays in refreshing segments and running programs in our North America region (R2).
Sorry if you're affected by this issue. Our tech team are working as quickly as possible to resolve it and get things back to normal. We'll share another update shortly.
Posted May 06, 2022 - 12:04 BST
This incident affected: North America - Dotdigital R2 (North America - Mail Sending, North America - Reporting).