Summary of impact:
On Saturday 18th September 2021 at 00:33 UTC, we stopped processing web behaviour session data in our European region (region 1). From this time, until the incident resolution, we continued to collect and queue up most new session data, creating a backlog. However, a small amount of session data was not accepted.
Customers using web behaviour data may have experienced the following problems:
- Programs which rely on this data may have not executed as expected
- Segments which rely on this data may not have returned all expected contacts
- Abandoned Browse may not have worked as expected
- Product Recommendations may not have worked as expected.
In addition, between 16:15 UTC on the 21st September and 07:42 UTC on 22nd September data sent by clients’ browsers to the web behaviour tracking endpoint was not accepted.
We resumed processing of web behaviour data at 07:42 UTC time on 22nd September 2021. And, the backlog of data we had accumulated cleared by 12:51 UTC time on 23rd September 2021
Root Cause:
Our investigation into this issue is still ongoing. However, we know a recent platform upgrade is the likely cause. We rolled back the upgrade during the incident in order to help correct the issue.
Mitigation:
The timeline for resolving this issue was (times stated in UTC):
Saturday 17th September 2021:
- 23:33: We stopped processing web behaviour session data in our European region.
Tuesday 21st September 2021:
- 16:15: The session collection in our CosmosDB became full, meaning no new data was being accepted.
Wednesday 22nd September 2021:
- 07:33: We identified the reason behind the errors and how CosmosDB had become full
- 07:42: We restarted webbehavior Pods, meaning new data could be accepted again
- 07:52: We scaled up all collections in our European region by adding capacity within our cloud provider
- 07:53: We began monitoring the data backlog
- 16:45: We created alerts to signal any subsequent processing issues.
Thursday 23rd September 2021:
- 07:33: We discovered the processing issues had reappeared
- 12:08: We reverted a recent change to the web behaviour services in case it was correlated with the outage
- 12:51: We had cleared the data backlog.
Next Steps:
We apologize for any inconvenience caused during this incident, as part of our continuous improvement program, we’ll:
- Continue to increase our understanding of why the outage occurred
- Implement specific monitoring of the affected data structures and create alerting to recognize a situation like this occurring in the future.