Metamaze Status - Degraded performance - potentially due to Azure issues

Degraded performance - potentially due to Azure issues

Incident Report for Metamaze

Resolved

All lag has been processed in the meantime, and everything should be returned to normal. Please contact Metamaze Support (support@metamaze.eu) if you see any remaining issues.

Posted Oct 20, 2023 - 12:53 CEST

Monitoring

The system is upscaling successfully now. For the impacted projects, processing will resume and any backlog will be processed automatically.
We will monitor for some time until the incident is resolved fully and all backlog is processed.

Posted Oct 20, 2023 - 12:15 CEST

Identified

Update from Azure
> Azure Services - West Europe - Investigating
> Starting at 07:31 UTC on 20 October 2023, we've determined a power event has briefly impacted a subset of infrastructure in the West Europe region. We're currently in the process of restoring affected infrastructure. Updates will be shared as information becomes available.

As the timestamps of the degraded performance on our side align perfectly with the timestamp of the Azure incident, our working hypothesis is that the incident is the cause of the models not scaling.

Impact
- Processing will be slower for models with spiky workloads since the models cannot scale dynamically
- Processing with models that were downscaled due to not receiving traffic will be impossible as they cannot scale up.
- Output services that rely on Azure networking can receive Bad Gateways

We continue monitoring the cluster and will update this ticket as needed.

Posted Oct 20, 2023 - 10:46 CEST

Investigating

We are currently investigating a performance issue with cluster scaling. At the same time, there is an ongoing incident on Azure West-Europe which might be related. https://azure.status.microsoft/en-gb/status

Posted Oct 20, 2023 - 10:33 CEST

This incident affected: Processing pipeline.