Our team was alerted to the unavailability of the bridge service for a large number of our customers. Upon this discovery, our team immediately started investigating the problem and identified the problem to be due to the discovery of some of our bridge services.
Prior to this, our team had made a number of improvements and refactoring that involved changing part of the discovery logic, which inadvertently caused a number of older bridge servers not to be discovered by our bot, and as a result, led to execution failures.
Although we do have extensive tests for both our bridge service and the discovery logic (e.g., including even full end-to-end validation of creating and using a bridge), this issue escaped our testing as it impacted only a subset of the services to be discovered.
As part of this, upon discovery of the root cause, we immediately shipped a fix to the logic that resolves this discovery. We would like to apologize for any trouble this caused for the subset of customers using our bridge service, and we are putting mitigations in place to ensure this issue would not occur again.