We would like to provide you with a post-mortem report concerning a service delivery disruption that impacted Swapcard customers on Thursday, the 15th of June, 2023, from 12:55 UTC to 13:35 UTC.
The purpose of this post-mortem is to offer insights into our initial evaluation of the incident as communicated on the Swapcard status page and to outline the remedial measures we have implemented to restore service
On Thursday, June 15th, at 12:55 UTC, we experienced latency issues and encountered an unexpected error page saying "Stay tuned." This occurred due to a memory leak leading to intensive CPU usage on one of our main API running on NodeJS technology. The memory leak is happening in certain circonstances under heavy load and causing various latency accros the product mostly because this services is in charge of doing the bridge between the interface and the databases.
At 12:57 UTC on Thursday, June 15th, one of our critical services encountered a problem with excessive memory and CPU usage. This issue arose due to an undetected memory leak, which resulted in intensive garbage collection tasks. As a consequence, memory was not freed up properly, and the CPU usage became significantly high. The combined effect of these symptoms caused various services to restart, affecting the replication of those services as well.
Immediately following the incident, the Infrastructure Team at Swapcard, in collaboration with our Site Reliability Engineers (SREs), identified the root cause and implemented a mitigation strategy to prevent incidents related to the affected components, specifically knowing the criticality of this components inside our architecture.
Our monitoring systems detected a disruption in traffic at 12:55 UTC, and as a result, the Swapcard Incident Response team was promptly activated. Our team worked diligently to prioritise and restore the quality of services to minimize the impact on our customers. Concurrently, we conducted a thorough investigation into the cause of the issue and implemented mitigations with the assistance of dedicated members from the Infrastructure and SRE teams.
To resolve the issue, our initial course of action involved implementing a mitigation strategy that gradually scaled up the various services. This approach aimed to distribute the memory load among a larger number of pods, preventing any individual pod from going offline due to excessive memory consumption. As a result, the services were able to recover more quickly, granting us additional time to investigate the underlying cause and implement appropriate long-term solutions.
The second strategy involved conducting in-depth analysis to identify the source of the abnormal memory consumption. We performed multiple analyses in the production environment, but the issue could not be reproduced in our staging and development environments, even under substantial simulated load. To troubleshoot this memory usage, we utilised the --heapsnapshot-signal
command line to delve into the memory allocation of our services. We conducted two analyses, one at the start of the service and another under heavy load. Due to the investigation being performed in the production environment, it took longer than usual to pinpoint the issue while prioritising the live environment's performance.
After conducting further investigation, we successfully identified the cause of the memory retention and addressed it to ensure stable memory usage in such circumstances. This approach also prevented excessive CPU usage caused by garbage collectors attempting to reclaim the used memory.
Please note that the events are presented in chronological order with their respective timestamps for clarity and coherence.
In accordance with our high standard in terms of deliverability, Swapcard has conduct several improvement in the memory consumption of two our our services to prevent further incident of same type. We also improve our capacity to detect earlier during the process this issues, preventing impact on our customer. Procedures and controls are already in place but this incident highlights the need for improvement.