Increased error rate on our API
Incident Report for Virtuoso
Postmortem

On Tuesday 16 July around 06:40 UTC, during a regular update of the platform on all environments we had a severe service degradation on the EU production environment. Usually this would trigger an automatic rollback of the update, but in this specific circumstance only a few of the services were able to return to the previous version, requiring manual intervention to restore full service availability.

Alarms in place immediately warned the team that something unexpected occurred and some services were not being able to initialize the new versions. After analyzing the data available and impact of potential decisions, at 07:20 UTC the team triggered a manual downgrade of the service back to the previously known working version, restoring full availability at 08:10 UTC.

We are sorry for the inconvenience caused to our customers affected by this outage. This falls below the standard of operational excellence that we wish to provide to our users. We have implemented and released a product change that hardened the automated capabilities of the platform to recover in the scenario described.

Posted Jul 17, 2024 - 09:42 UTC

Resolved
This incident has been resolved.
Posted Jul 16, 2024 - 09:05 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Jul 16, 2024 - 07:51 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted Jul 16, 2024 - 07:42 UTC
Investigating
We are currently investigating this issue.
Posted Jul 16, 2024 - 06:50 UTC
This incident affected: Application & API and Jobs and bot cluster.