Between 06:30 and 07:50 UTC on 13th April 2022, as we deployed the latest version of Dotdigital, errors began to occur in the event processing and contact import services. Some customers may have experienced issues with contact imports and open reporting for email campaign sends.
Between 08:40 and 09:15 UTC, we experienced errors in the image resizing service meaning customers may have seen missing images in our EasyEditor. However, images in sent campaigns continued to load normally.
Contact imports
We introduced a bug fix for this release. When the fix was deployed it caused imports containing a suppressed contact with empty data in mapped fields to error. This ultimately caused those imports to fail.
Opens
As part of routine improvements to our services, we made a change to how we process open event messages. This change meant some messages failed to process correctly and open data was lost.
Images
Our ongoing application modernization work involved converting the image resizer to run on the latest framework. We are still identifying the root cause for this issue.
Our mitigation steps for these issues were (times stated in UTC):
Contact imports
06:31: We began our release
06:35: We saw errors in the contact importer
06:45: We rolled back the contact importer and contacts could be imported successfully.
Opens
06:31: We began our release
06:35. We saw errors in the event parser service and investigations began
07:05: Our investigations revealed opens were being lost and we reverted the service to the previous version. At this point, email open data was being captured as normal.
Images
08:04: We noticed a small number of errors on the image resizer application. This initially appeared as a brief database connection issue.
09:04: We began receiving customer reports of missing images in our EasyEditor. Our teams began an investigation.
09:18: We reverted the image resizer to the previous version and images loaded as expected in our EasyEditor.
The intended changes for the contact importer and opens have been reworked and were subsequently released successfully later on in the day.
We are still investigating the images issue. Once fully understood we will investigate running the old and new site side by side. We will then be able to migrate traffic across, detecting any issues before customer impact occurs.
We have learned some lessons from today and will adjust future testing and deployment plans to prevent re-occurrence.