Degraded performance for self-serve integrations
Incident Report for Dotdigital
Postmortem

Summary of impact:

At approximately 16:52 UTC on Thursday 14th April 2022, we detected issues with self serve Integration Hub integrations in all regions. We could see they were unable to complete their scheduled executions successfully. We resolved the issue with services being restored across all regions at approximately 18:46 UTC on Thursday 14th April 2022. 

Customers may have experienced some of the following issues:

  • Delays in receiving data from self serve Integration Hub integrations:

    • Address book and segment stats to Google Sheets
    • Campaign reports to Google Sheets
    • Eventbrite events synchronization
    • Eventbrite check in/out
    • Form stats and responses to Google Sheets
    • Zoom webinars
  • Possible data loss due to being unable to process events for:

    • SMS to Email
    • Typeform

Root Cause:

The system hosting the self serve integrations experienced an issue with the data storage systems used to hold the configuration and state. This caused the integrations to be unable to retrieve their configuration reliably and caused them to error during any attempted executions during the outage period.

Mitigation:

The timeline (in UTC) for resolving this issue was:

  • 16:52: We detected issues with self serve integrations terminating with errors and started investigations
  • 17:12: We established the issues were widespread and across all regions. The storage system used for storing state and configuration was identified as the root cause for the issues
  • 17:18: We raised a priority 1 ticket with our provider and continued to work with them to help identify the issue and test resolutions
  • 18:46: We were notified by our provider the issue with storage was resolved. We verified this and started checking integrations were functioning correctly
  • 20:29: After extensive checks and monitoring the incident, we declared the issue resolved. Services had returned to normal operation at approximately 18:46 UTC

Next Steps:

  • Work with our provider to identify the root cause of the storage failure and undertake any appropriate actions to avoid a future incident (In progress)
  • Update the heartbeat check to verify the storage is operational for the self serve integrations. We believe this may give us faster detection of a similar issues in the future (Completed 19th April 2022)
Posted Apr 19, 2022 - 17:10 BST

Resolved
We’ve monitored the situation and we’re confident this issue is fully resolved. Sorry it happened and for the interruption it caused.
Posted Apr 14, 2022 - 21:48 BST
Monitoring
All functionality is now restored. We're keeping a watchful eye on things to make sure it stays that way. We'll let you know when we’re 100% confident everything is fully back to normal.
Posted Apr 14, 2022 - 21:32 BST
Update
We are continuing to investigate the issue. Sorry if you're affected, but our tech team is working as quickly as possible to resolve the issue and get things back to normal.
Posted Apr 14, 2022 - 20:14 BST
Investigating
We're investigating an issue with accessing the Integration Hub to install any new self-service integrations. Those already installed may experience degraded performance. Sorry if you're affected, but our tech team is working as quickly as possible to resolve the issue and get things back to normal. We'll share another update shortly.
Posted Apr 14, 2022 - 19:02 BST
This incident affected: North America - Dotdigital R2 (North America - Integration Hub), Europe - Dotdigital R1 (Europe - Integration Hub), and Asia Pacific - Dotdigital R3 (Asia Pacific - Integration Hub).