We are ready to furnish a comprehensive post-incident analysis concerning a service disruption that affected our Event App product. In the course of this incident, Event App users encounter significant latency and impaired user experience leading to “Stay tuned” unavailability pages and potentially causing “Oops, something went wrong” during the login phase.
The purpose of this post-mortem is to share insights into our initial assessment of the situation, as communicated on the Swapcard status page, and to outline the corrective actions we've taken to restore normal service.
In the first week of November, we identified a technical issue that initially went unnoticed by our internal monitoring system. Typically, incidents of this nature are promptly detected by our automated monitoring system, but this time, it wasn't the case. This delayed the reporting on our status page. We became aware of the incident after receiving reports of service unavailability during a specific timeframe. The Swapcard Response Team promptly addressed this concern by conducting a thorough investigation into the reported timeline.
On October 25th, around 2:00 PM CET, Swapcard experienced a series of both related and unrelated events that exerted significant pressure on our various systems. These events can be summarised as a substantial increase in inbound traffic coinciding with a service redeployment and congestion on our infrastructure nodes. It's worth highlighting that individually, none of these events typically disrupt our services. Swapcard is accustomed to efficiently managing large trade shows with numerous attendees and a continuous influx of inbound requests without causing any disruptions.
In order to avoid singling out a particular event, we regard these occurrences as a singular disruption. Our investigation has revealed that it is the culmination of various unexpected events that leads to disruptions in the user experience on our Event App.
The system has automatically recovered from this disruption ~5min after the start of the disruption at 2:05 PM CET, this automatic recovered is the success of the several safe-guard implemented by our Site Reliability Engineering team to ensure fast recover and mitigation of such incidents for our customers and end-users.
As soon as the incident was reported, our team immediately initiated an investigation to ensure a thorough understanding and accurate reporting of the situation. Throughout the investigative phase, which took place from November 1st to November 2nd, the team identified several areas for improvement to enhance the management of both related and unrelated events, which are seldom encountered together.
In our dedication to providing an outstanding experience for Swapcard users, we have implemented and planned these changes to enhance the management of unforeseen events.
These improvements encompassed:
This incident has underscored the potential for enhancements in our processes and controls. While we already have established procedures in place, we acknowledge the opportunity for improvement. This proactive approach ensures that we continually strengthen the resilience of our systems and minimize the potential for disruptions.