Increased error rate on our API

Incident Report for Virtuoso

Postmortem

On Tuesday 16 July around 06:40 UTC, during a regular update of the platform on all environments we had a severe service degradation on the EU production environment. Usually this would trigger an automatic rollback of the update, but in this specific circumstance only a few of the services were able to return to the previous version, requiring manual intervention to restore full service availability.

Alarms in place immediately warned the team that something unexpected occurred and some services were not being able to initialize the new versions. After analyzing the data available and impact of potential decisions, at 07:20 UTC the team triggered a manual downgrade of the service back to the previously known working version, restoring full availability at 08:10 UTC.

We are sorry for the inconvenience caused to our customers affected by this outage. This falls below the standard of operational excellence that we wish to provide to our users. We have implemented and released a product change that hardened the automated capabilities of the platform to recover in the scenario described.

Posted Jul 17, 2024 - 09:42 UTC

Resolved

This incident has been resolved.

Posted Jul 16, 2024 - 09:05 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Jul 16, 2024 - 07:51 UTC

Identified

The issue has been identified and a fix is being implemented.

Posted Jul 16, 2024 - 07:42 UTC

Investigating

We are currently investigating this issue.

Posted Jul 16, 2024 - 06:50 UTC

This incident affected: Application & API and Jobs and bot cluster.