Delayed executions

Incident Report for Virtuoso

Postmortem

At around 0845 GMT on 6 February 2025, our engineering team identified an increased error rate when the platform attempted to launch new workers to provide Live Authoring sessions and run executions requested by customers. Shortly after this time we confirmed that there was an issue pulling images from the Docker Hub registry, which meant we could not launch additional capacity. This resulted in a degradation of experience until the upstream issue was corrected at 0913, after which capacity returned to normal as the backlog of work was cleared.

We have conducted an investigation into how we may insulate ourselves from similar incidents in the future. After reviewing our infrastructure configuration, we have identified that some monitoring “sidecar” containers that we use to collect logging information were fetching images from the public Docker Hub registry. This meant that while our application images were available, the incident meant that our containers could not launch correctly. We will shortly be deploying a change to switch these monitoring containers over to the same registry the core Virtuoso services use to prevent this from recurring.

Posted Feb 06, 2025 - 13:49 UTC

Resolved

This incident has been resolved.
Posted Feb 06, 2025 - 09:35 UTC

Monitoring

The upstream vendor has reported the outage has concluded. We are currently monitoring the cluster as it processes the backlog of jobs.
Posted Feb 06, 2025 - 09:30 UTC

Investigating

We are investigating a supplier incident that has caused reduced capacity of the Virtuoso job cluster. Customers will continue to be able to launch jobs, but they may take longer to start than usual.
Posted Feb 06, 2025 - 09:05 UTC
This incident affected: Jobs and bot cluster.