At 04:12 UTC on Friday 30 July 2021, the following functionality started failing for customers in our Asia Pacific region (R3) of Engagement Cloud:
We restored all services at 10:49 UTC and any backlogs were cleared within a few minutes.
At 04:12 UTC, Site Reliability Engineers (SREs) at Google Cloud Platform (GCP) added our Engagement Cloud Asia Pacific instance (R3) to a blocklist, as they noticed excessive query loads on their BigQuery product in their Australia-east region. They were initially concerned the load could impact all their customers. At 08:00 UTC, our teams were alerted to the issue by Google and by 08:15 UTC we had discovered the root cause of the issue. The excessive load was being generated by our Ecommerce Retail Dashboards updating process.
We disabled the Ecommerce Retail Dashboards updating process at 09:15 UTC. We informed Engineers at Google we had stopped the problematic query loads and requested they removed the blocking. As a precaution, we also paused Segment refreshes and Programs at this time whilst we waited for another update from Google.
At 10:49 UTC, Google lifted the block, so we restarted our services. Our monitoring confirmed everything was working as expected and service backlogs cleared within a few minutes. Some Programs that were scheduled to execute whilst the service was offline will not have executed at the scheduled time. However, they were retried and were successfully sent approx. 6 hours later.
We’re really sorry for the disruption this incident caused. We have several actions to work through in order to prevent a reoccurrence: