Intermittent segment refresh and Product Recommendations issues in our Asia Pacific (R3) region
Incident Report for dotdigital
Postmortem

Summary of Impact:

At 04:12 UTC on Friday 30 July 2021, the following functionality started failing for customers in our Asia Pacific region (R3) of Engagement Cloud:

  • Segment refreshes: This would’ve had a knock-on effect of preventing program decision nodes from completing and campaign sends that rely on segments would’ve been cancelled
  • Product recommendations and product blocks
  • Adding/importing insight data via the API
  • Ecommerce connectors: Magento 1, 2, Commerce Flow (Shopify, BigCommerce, WooCommerce and Shopware)
  • Ecommerce dashboards: Commerce Intelligence

We restored all services at 10:49 UTC and any backlogs were cleared within a few minutes.

Root Cause:

At 04:12 UTC, Site Reliability Engineers (SREs) at Google Cloud Platform (GCP) added our Engagement Cloud Asia Pacific instance (R3) to a blocklist, as they noticed excessive query loads on their BigQuery product in their Australia-east region. They were initially concerned the load could impact all their customers. At 08:00 UTC, our teams were alerted to the issue by Google and by 08:15 UTC we had discovered the root cause of the issue. The excessive load was being generated by our Ecommerce Retail Dashboards updating process.

Mitigation:

We disabled the Ecommerce Retail Dashboards updating process at 09:15 UTC. We informed Engineers at Google we had stopped the problematic query loads and requested they removed the blocking. As a precaution, we also paused Segment refreshes and Programs at this time whilst we waited for another update from Google.

At 10:49 UTC, Google lifted the block, so we restarted our services. Our monitoring confirmed everything was working as expected and service backlogs cleared within a few minutes. Some Programs that were scheduled to execute whilst the service was offline will not have executed at the scheduled time. However, they were retried and were successfully sent approx. 6 hours later.

Next Steps:

We’re really sorry for the disruption this incident caused. We have several actions to work through in order to prevent a reoccurrence:

  • Improved alerting: To allow us to react even quicker to excessive query loads in the future
  • Continued collaboration with Google: We’ll review what caused the blocklist and review their recommendations on optimizing query performance
  • Defensive coding: We’ll review if we can proactively spot excessive query loads that could potentially cause further blocklists
  • Control measures: We’ll improve our ability to control the concurrency of requests
  • Improved Service Operation's controls: To allow us to turn off excessive query performance on a per account or feature basis
  • Quotas: We’ll explore the feasibility of setting Quotas on a per-account basis to minimize the impact of a single account
  • Ecommerce dashboard refreshing: We’ll fully restore automatic and manual refreshing to Ecommerce dashboards for all R3 customers (removing the current temporary restrictions we have in place).
Posted Jul 30, 2021 - 17:23 BST

Resolved
We're pleased to say manual refreshing is now available on our Ecommerce dashboards for the majority of our R3 customers. For now, we've needed to pause the auto-refreshing function of these dashboards for all R3 customers. And, there are a small number of customers who will currently not be able to refresh their Ecommerce dashboards at all (via automatic or manual methods). This is a temporary precaution we've needed to take to prevent a reoccurrence and will be removed soon.

Any campaigns that had earlier failed to send due to segments being unable to refresh will have been re-tried and will have successfully completed (approx. 6 hours after their scheduled time).

Sorry this incident happened and for the interruption it caused. We're going to write a report to share the specific details of today's issue and finalize our work on fully rectifying our Ecommerce dashboard refreshing. We'll attach it here when it's ready.
Posted Jul 30, 2021 - 17:08 BST
Monitoring
We've implemented a fix and are pleased to say segment refreshes and programs in R3 are back up and running as normal. We're still working on fully rectifying and stabilizing the Ecommerce Dashboards. We're continuing to resolve and monitor the situation, and we'll post another update soon.
Posted Jul 30, 2021 - 12:10 BST
Update
We're continuing to work with our third-party provider to resolve the issue in R3. In addition to the affected services mentioned in our last update, Ecommerce Dashboards will not be refreshing currently until we resolve the underlying issue. More news from us shortly.
Posted Jul 30, 2021 - 11:37 BST
Identified
We've located the issue with a third-party provider and we're working with them to resolve the issue. While we resolve this issue, we've needed to pause program enrollments too. This means all sends in our R3 region involving segment refreshes and driven by programs are currently in a paused state. Sends that aren't reliant on segments/programs will continue to complete successfully. We're working to resolve this as quickly as possible. We'll be back again with an update ASAP.
Posted Jul 30, 2021 - 10:22 BST
Investigating
We've discovered an intermittent issue with segment refreshes and Product Recommendations in our R3 region. We're working to fix it as quickly as possible. Sorry if this is affecting you, but things should be back to normal very soon. Look out for more news from us shortly.
Posted Jul 30, 2021 - 09:27 BST