Delayed Email Sends - Enagagement Cloud - North America
Incident Report for dotdigital
Postmortem

Summary of impact:

From approximately 11:05 UTC on Saturday 25th April 2020, some email sends from our North American region failed until approximately 21:28 UTC.

A small number of users will have noticed some campaign sends failed to complete and remained in their Outbox. We’ve identified any customers affected and have proactively reached out to help resolve the issue.

Root Cause:

Our email sending infrastructure was severely restricted after multiple disk failures occurred in multiple servers simultaneously. The disk failures are the result of a vendor firmware bug which causes disks to fail after a fixed period of time.

The failed servers were no longer able to accept new email and this resulted in application errors which impacted campaign email sends.

Engagement Cloud compiles campaign email sends into batches of emails and batches are distributed over multiple email sending servers. During this period, some batches hit failed servers and others were sent by unaffected servers. This resulted in some sends being unaffected, but others could’ve partially or completely failed to send.

Mitigation:

We removed the faulty servers from duty and campaign email sends continued using the remaining healthy machines.

Next Steps:

We’ve identified 3 follow-on work items:

  1. We’re working with our hardware supplier to ensure our remaining servers will not suffer from this firmware bug
  2. We’ll review our firmware and hardware driver update policies to ensure we’re ahead of any future known issues with our supplier
  3. Our software is able to cope with hardware failure if a server is offline. In this case the server remained online but was unable to accept new requests, therefore further code changes will be investigated.
Posted Apr 27, 2020 - 13:50 BST

Resolved
This incident has been resolved.
Posted Apr 25, 2020 - 22:45 BST
Monitoring
The faulty hardware has been removed and sends have been restored.
Posted Apr 25, 2020 - 22:43 BST
Investigating
We are currently investigating a hardware issue in our North America sending node.
Some customers in this region may experience delayed or stuck sends.
We apologise for the inconvenience.
Further updates will be provided when available.
Posted Apr 25, 2020 - 19:57 BST
This incident affected: North America - Engagement Cloud r2 (North America - Mail Sending, North America - Transactional Email).