Significant latency and impaired user experience impacting Event App

Incident Report for Swapcard

Postmortem

We are ready to furnish a comprehensive post-incident analysis concerning a service disruption that affected our Event App product. In the course of this incident, Event App users encounter significant latency and impaired user experience leading to “Stay tuned” unavailability pages and potentially causing “Oops, something went wrong” during the login phase.

The purpose of this post-mortem is to share insights into our initial assessment of the situation, as communicated on the Swapcard status page, and to outline the corrective actions we've taken to restore normal service.

Incident Overview

‌

In the first week of November, we identified a technical issue that initially went unnoticed by our internal monitoring system. Typically, incidents of this nature are promptly detected by our automated monitoring system, but this time, it wasn't the case. This delayed the reporting on our status page. We became aware of the incident after receiving reports of service unavailability during a specific timeframe. The Swapcard Response Team promptly addressed this concern by conducting a thorough investigation into the reported timeline.

On October 25th, around 2:00 PM CET, Swapcard experienced a series of both related and unrelated events that exerted significant pressure on our various systems. These events can be summarised as a substantial increase in inbound traffic coinciding with a service redeployment and congestion on our infrastructure nodes. It's worth highlighting that individually, none of these events typically disrupt our services. Swapcard is accustomed to efficiently managing large trade shows with numerous attendees and a continuous influx of inbound requests without causing any disruptions.

In order to avoid singling out a particular event, we regard these occurrences as a singular disruption. Our investigation has revealed that it is the culmination of various unexpected events that leads to disruptions in the user experience on our Event App.

The system has automatically recovered from this disruption ~5min after the start of the disruption at 2:05 PM CET, this automatic recovered is the success of the several safe-guard implemented by our Site Reliability Engineering team to ensure fast recover and mitigation of such incidents for our customers and end-users.

Mitigation and Resolution

As soon as the incident was reported, our team immediately initiated an investigation to ensure a thorough understanding and accurate reporting of the situation. Throughout the investigative phase, which took place from November 1st to November 2nd, the team identified several areas for improvement to enhance the management of both related and unrelated events, which are seldom encountered together.

In our dedication to providing an outstanding experience for Swapcard users, we have implemented and planned these changes to enhance the management of unforeseen events.

These improvements encompassed:

Enhanced monitoring to more effectively detect unforeseen events, resulting in quicker incident assessment.
The Site Reliability Engineering team will implement additional safeguards to alleviate infrastructure node congestion and minimize the impact, reducing potential disruptions.
A review of the cache mechanism during this timeframe to further diminish the potential impact of disruptions when under high pressure.
An evaluation and update of the automatic redeployment procedure to enable a more incremental rollout, thereby reducing strain on infrastructure nodes.
Reduction of the memory footprint of one of our logging systems, which was exerting pressure during high loads.

Future Planning

This incident has underscored the potential for enhancements in our processes and controls. While we already have established procedures in place, we acknowledge the opportunity for improvement. This proactive approach ensures that we continually strengthen the resilience of our systems and minimize the potential for disruptions.

Posted Nov 02, 2023 - 19:35 UTC

Resolved

The incident began around 2:00 PM (CEST) and the system autonomously returned to normal around 2:05 PM (CEST). We will now perform a comprehensive post-mortem analysis to uncover the underlying issue. During our initial investigation, we found a combination of both correlated and uncorrelated events, such as fluctuations in request volumes, redeployments, server activity, and cache congestion, which collectively exerted substantial strain on Swapcard systems, resulting in elevated latency and a compromised user experience.

Posted Nov 25, 2023 - 13:00 UTC