Backstage experienced issues related to video, audio, and screen-sharing

Incident Report for Swapcard

Postmortem

We are ready to furnish a comprehensive post-incident analysis concerning a service disruption that affected our Backstage product. In the course of this incident, Backstage users encountered problems with video, audio, and screen-sharing, resulting in content not being displayed correctly on the main broadcasted stage

The purpose of this post-mortem is to share insights into our initial assessment of the situation, as communicated on the Swapcard status page, and to outline the corrective actions we've taken to restore normal service.

Incident Overview

In the week of October 17th, a technical issue came to our attention. This problem emerged when we added or removed speakers or moderators from the main stage. It had an adverse effect on the encoder, which is responsible for monitoring these modifications and generating the final broadcast output for our end users. As soon as we identified this critical problem, we promptly reported it to our service provider.

On October 18th at 10:00 AM CET, our external service provider confirmed that they had recently released a significant backend update known as Mesh SFU, designed to handle large-scale sessions. Unfortunately, this update introduced a bug specific to a rare scenario involving role changes.

Mitigation and Resolution

As soon as the incident was reported, our team promptly informed our partner about the unusual behavior observed in the video/audio and screen-sharing features, which were not functioning as expected on the main stage. We assured our partner that we would address and resolve these issues within the agreed upon Service Level Agreement (SLA).

In our commitment to delivering an exceptional experience for Swapcard users, we have been leveraging our provider's API to create a distinctive workflow for our partners. To proactively prevent similar issues in the future, we are actively engaged in discussions to ensure this workflow is thoroughly tested in all of their testing scenarios. Additionally, we are exploring options to incorporate this scenario into our automated testing processes. The ultimate goal is to establish effective testing procedures and early detection of such bugs

Future Planning

This incident has underscored the potential for enhancements in our processes and controls. While we already have established procedures in place, we acknowledge the opportunity for improvement. This proactive approach ensures that we continually strengthen the resilience of our systems and minimize the potential for disruptions.

Posted Oct 25, 2023 - 16:00 UTC

Resolved

The incident has been resolved, 18th October at 10am CET

Posted Oct 17, 2023 - 09:00 UTC