On Tuesday 16 July around 06:40 UTC, during a regular update of the platform on all environments we had a severe service degradation on the EU production environment. Usually this would trigger an automatic rollback of the update, but in this specific circumstance only a few of the services were able to return to the previous version, requiring manual intervention to restore full service availability.
Alarms in place immediately warned the team that something unexpected occurred and some services were not being able to initialize the new versions. After analyzing the data available and impact of potential decisions, at 07:20 UTC the team triggered a manual downgrade of the service back to the previously known working version, restoring full availability at 08:10 UTC.
We are sorry for the inconvenience caused to our customers affected by this outage. This falls below the standard of operational excellence that we wish to provide to our users. We have implemented and released a product change that hardened the automated capabilities of the platform to recover in the scenario described.