Entity extraction predictions not appearing

Incident Report for Metamaze

Postmortem

Description of what went wrong

Historically, it was possible to deploy language-specific versions of a model: you could e.g. use one version of a model for NL, and another version for FR. This happened through so-called “modelConnections” that are language dependent.
We noticed that this functionality was not actively used, and removed this ability in the front-end almost a year ago. The back-end theoretically still supported this scenario of language-specific model deployments.
While cleaning up tech debt, we refactored this part to simplify the code. That refactor also removed the associated unit tests, since the functionality was to be removed.
Before releasing this, a migration script had to be run to ensure all modelConnections were updated to the multi-lingual version. However, there was a miscommunication between two colleagues where one colleague believed the migration was already performed, while in fact it hadn’t been. When releasing the updated code, that meant some models only had language-specific model versions. For languages that were not officially deployed, that meant there was no modelConnection, instead of falling back to the multilingual model as was supposed to happen.

Impact

Projects with document types that had not released a new versions of the model for a long time
Uploads in languages that were not deployed for that document type.

Alerting and monitoring

Our regular pipeline uptime monitor did not fail since it only had a multi-lingual model deployed. This caused slowness in Metamaze’s reaction since we were not automatically alerted.

Metamaze’s response (all times in CEST)

16:24 On-call first-line support escalated a customer support ticket, prioritising this as a potential P1 incident.
16:25 First response by Metamaze Engineer to estimate the extent of the incident
16:34 Incident confirmed as P1 incident, escalation to the broader team and Metamaze leadership
16:44 Opened incident on statuspage, alerting all customers of potential impact.
16:44 The release was reverted so processing restarted for new uploads
17:05 Faulty messages were submitted for reprocessing
17:26 All faulty message have finished reprocessing. Incident closed.

The following preventive actions have been identified to mitigate this from happening again

Defined ways to improve technical monitoring and alerting for production customers based on actual performance, outside of our internal monitoring projects
Re-definition of alert routing rules and prioritisation to escalate to the wider team faster
Improved automation rules w.r.t. alerting and monitoring
Improved the process for handling changes between on-call engineers and support

We apologize for the inconvenience this has caused for the impacted projects.

Posted Sep 22, 2023 - 15:01 CEST

Resolved

All lag has been processed now and all processing should be back to normal.

We will finish our internal investigations and will follow up with a detailed postmortem outlining this incident in the coming days.

Posted Sep 19, 2023 - 17:26 CEST

Monitoring

The lag is being processed at the moment. We expect about 10 more minutes to finish all lag in processing.

Posted Sep 19, 2023 - 17:21 CEST

Identified

We believe we have identified the issue and have released a fix. New uploads should be processed correctly again. We are working on re-processing the backlogs of the uploads that failed and will update this incident accordingly.

From our preliminary research, it looks like the incident only affected projects where specific project languages were used.

Posted Sep 19, 2023 - 16:57 CEST

Investigating

We are currently investigating an issue where Entity Extraction predictions are not appearing for some languages.

Posted Sep 19, 2023 - 16:44 CEST

This incident affected: Processing pipeline.