Executions not starting or completing correctly
Incident Report for Virtuoso
Postmortem

On Wednesday 27th between 07:15 and 9:15 UTC we had an outage on the job submission and processing systems on the EU production environment. Alarms at 07:28 notified the team of increased error rate connected to this component leading to immediately started investigating, and at 08:05 after finishing the initial triage we took actions to contain the problem. Of this we highlight that at 08:15 we disabled temporarily job creation to allow the system to process the pending requests, and re-enabled it near the end of the outage window once the system stabilized.

Between 07:15 and 08:15, some jobs submitted were not correctly populated with the journeys to execute, and some API requests connected to execution details took significantly more time to run. Job affected by this will automatically timeout, and will not execute. We advise customers to retry any jobs that are in this state.

Live authoring executions remained available during the outage, although slightly impacted by the API slowdown.

Posted Nov 29, 2024 - 11:56 UTC

Resolved
Executions are now launching and continuing normally, and the API is fully operational.
Posted Nov 27, 2024 - 11:29 UTC
Monitoring
The fix has reached our production environments, and we are continuing to monitor the situation. Executions are now unblocked.
Posted Nov 27, 2024 - 09:16 UTC
Identified
We have identified the root cause of the issue and are working to implement a fix.
Posted Nov 27, 2024 - 08:33 UTC
Update
We are continuing to investigate this issue.
Posted Nov 27, 2024 - 08:08 UTC
Investigating
We are observing difficulties launching and completing jobs on the Virtuoso cluster. We are investigating the issue and will provide an update when we have identified the cause.
Posted Nov 27, 2024 - 08:07 UTC
This incident affected: Application & API and Jobs and bot cluster.