Delays processing inbound replies & bounces
Incident Report for dotdigital
Postmortem

Summary of impact

After a regular application deployment on Wednesday 3rd March 2021, we encountered an issue processing bounces, replies and other inbound emails. There were two specific issues as a result:

  1. The processing of these inbound messages was delayed by up to 2 hours in some cases
  2. A small number of these inbound messages were lost.

There was no impact to outbound messages or transactional emails during this time.

A summarized timeline of events:

  • 08:16 UTC: We introduced a faulty change causing delays with processing inbound emails and some emails to be lost
  • 08:46 UTC: We completed a rollback of the change meaning no more email was being lost, but delays in processing emails continued
  • 10:00 UTC: We confirmed email processing was back to normal and any backlog of emails was cleared.

Root cause

Our platform receives several different inbound email types:

  • Campaign bounces
  • Replies
  • Transactional email
  • Abuse reports.

They are processed by our inbound mail services and routed to the right parts of Engagement Cloud. We added a new piece of functionality to these inbound mail services in our scheduled application deployment, which had a bug impacting some of those email types. This bug caused some messages not to be saved correctly. This meant we could recover the vast majority of these messages, but a small number couldn’t be recovered successfully (once the bug was fixed).

Mitigation

After we identified the issue, we rolled back to the previous version of the inbound email services to stabilize them. However, because we had accumulated a backlog of unprocessed email during the error period, we made a hotfix to make sure the backlog was processed quickly. Without this hotfix, the time to clear the backlog of unprocessed email would’ve been substantially longer.

Next Steps

We will: 

  • Rework the specific piece of functionality to make sure it has no further bugs, prior to re-releasing it
  • Add specific test cases for this scenario to avoid future recurrences
  • Change the failure mode to retain messages, rather than to discard them, in cases where an error occurs.
Posted Mar 03, 2021 - 15:28 GMT

Resolved
Good news: we’ve released a fix for the issue meaning any backlogged messages cleared quickly. We’ll continue to keep a close eye on things, but we're confident everything is back to normal now. We’ll write a detailed description of what caused the issue. Check back here in a day or two for the report.
Thanks for sticking with us while we sorted everything and sorry for the disruption.
Posted Mar 03, 2021 - 10:11 GMT
Identified
Thanks for your patience. We’ve identified the problem and are making good progress on resolving the issue. We’ll be back with another update very soon.
Posted Mar 03, 2021 - 09:54 GMT
Investigating
We've discovered an issue with delays to inbound replies & bounces, and our tech team are working flat out to fix it. Sorry if this is affecting you, but things should be back to normal very soon. Look out for more news from us shortly.
Posted Mar 03, 2021 - 09:08 GMT