We would like to provide you with a post-mortem report concerning a service delivery disruption that impacted Swapcard customers on Wednesday, the 17th of May, 2023, from 17:01 UTC to 17:20 UTC.
The purpose of this post-mortem is to offer insights into our initial evaluation of the incident as communicated on the Swapcard status page and to outline the remedial measures we have implemented to restore service
We encountered a significant outage on Wednesday, the 17th of May, at 17:01 UTC, when our production pods running on Kubernetes were abruptly terminated in a cascading manner. This disruption resulted in service delivery issues for Swapcard across all regions.
On Wednesday, the 17th of May at 17:01 UTC, our Kubernetes cluster experienced a significant issue where a large number of production pods were unexpectedly terminated. This issue arose due to a recent upgrade of the pods scheduler version, combined with a specific parameter that caused problems in a particular scenario. It's important to note that this configuration had been running without any problems for multiple days, and it had also been tested successfully on our non-production clusters. Unfortunately, a combination of manual actions taken to upgrade our integration services, along with this specific configuration, triggered an unintended cascade termination of our production pods. These changes and manual actions were made in accordance with Swapcard's standard infrastructure and security change practices.
Immediately following the incident, the Infrastructure Team at Swapcard, in collaboration with our Site Reliability Engineers (SREs), identified the root cause and implemented a mitigation strategy to prevent any future incidents related to the affected components, specifically the pods scheduler mechanism. We have a high level of confidence that these components will not lead to similar mass terminations of our production pods in the future.
We want to emphasize that we handled this incident in compliance with our Disaster Recovery Plan (DRP). The utilization of our GitOps methodology and Infrastructure as Code (IaC) approach proved invaluable in minimizing the impact on our customers and reducing the resolution time. The situation we encountered on Wednesday, May 17th, can be classified as a worst-case scenario from an infrastructure standpoint.
Our monitoring systems detected a disruption in traffic at 17:01 UTC, and as a result, the Swapcard Incident Response team was promptly activated. Our team worked diligently to prioritize and restore services to minimize the impact on our customers. Concurrently, we conducted a thorough investigation into the cause of the issue and implemented mitigations with the assistance of dedicated members from the Infrastructure and SRE teams.
The service disruption came to a halt simultaneously with the restart and deployment of the 80 pods. This incident brought to light certain improvements we can implement to achieve even faster recovery times in the event of a worst-case scenario. Following the restoration of the Kubernetes pods, Swapcard engineering diligently monitored all services to ensure a complete and proper recovery, which was achieved by approximately 17:20 UTC.
Consequently, customers would have observed the availability of Swapcard's services. At 17:24 UTC, Swapcard officially confirmed that services had been restored to pre-incident levels, ensuring that traffic had returned to its previous rate before the incident.
Time alerted to the outage: 1 minutes
Time to identify the source of disruption: 1 minutes
Time to initiate recovery: 2 minutes
Time to monitor and restore capacities pre-crash: 14 minutes
Please note that the events are presented in chronological order with their respective timestamps for clarity and coherence.
The incident today has brought attention to potential enhancements we can implement to further improve our recovery time in worst-case scenarios, aligning with our Disaster Recovery Plan (DRP). Although our existing procedures and controls are already in place, we recognize the opportunity for improvement.
We assess the probability of a similar issue recurring as extremely low. However, we remain committed to minimizing any potential risks by implementing future interventions and enhancements to our infrastructure and procedures. This proactive approach ensures that we continue to enhance the resilience of our systems and mitigate any potential disruptions