Intermittent latency on the event app & studio causing error page or latency

Incident Report for Swapcard

Postmortem

We would like to provide you with a post-mortem report concerning a service delivery disruption that impacted Swapcard customers on Thursday, the 15th of June, 2023, from 12:55 UTC to 13:35 UTC.

The purpose of this post-mortem is to offer insights into our initial evaluation of the incident as communicated on the Swapcard status page and to outline the remedial measures we have implemented to restore service

Incident summary

On Thursday, June 15th, at 12:55 UTC, we experienced latency issues and encountered an unexpected error page saying "Stay tuned." This occurred due to a memory leak leading to intensive CPU usage on one of our main API running on NodeJS technology. The memory leak is happening in certain circonstances under heavy load and causing various latency accros the product mostly because this services is in charge of doing the bridge between the interface and the databases.

At 12:57 UTC on Thursday, June 15th, one of our critical services encountered a problem with excessive memory and CPU usage. This issue arose due to an undetected memory leak, which resulted in intensive garbage collection tasks. As a consequence, memory was not freed up properly, and the CPU usage became significantly high. The combined effect of these symptoms caused various services to restart, affecting the replication of those services as well.

Immediately following the incident, the Infrastructure Team at Swapcard, in collaboration with our Site Reliability Engineers (SREs), identified the root cause and implemented a mitigation strategy to prevent incidents related to the affected components, specifically knowing the criticality of this components inside our architecture.

Our monitoring systems detected a disruption in traffic at 12:55 UTC, and as a result, the Swapcard Incident Response team was promptly activated. Our team worked diligently to prioritise and restore the quality of services to minimize the impact on our customers. Concurrently, we conducted a thorough investigation into the cause of the issue and implemented mitigations with the assistance of dedicated members from the Infrastructure and SRE teams.

Mitigation deployment

To resolve the issue, our initial course of action involved implementing a mitigation strategy that gradually scaled up the various services. This approach aimed to distribute the memory load among a larger number of pods, preventing any individual pod from going offline due to excessive memory consumption. As a result, the services were able to recover more quickly, granting us additional time to investigate the underlying cause and implement appropriate long-term solutions.

The second strategy involved conducting in-depth analysis to identify the source of the abnormal memory consumption. We performed multiple analyses in the production environment, but the issue could not be reproduced in our staging and development environments, even under substantial simulated load. To troubleshoot this memory usage, we utilised the --heapsnapshot-signal command line to delve into the memory allocation of our services. We conducted two analyses, one at the start of the service and another under heavy load. Due to the investigation being performed in the production environment, it took longer than usual to pinpoint the issue while prioritising the live environment's performance.

After conducting further investigation, we successfully identified the cause of the memory retention and addressed it to ensure stable memory usage in such circumstances. This approach also prevented excessive CPU usage caused by garbage collectors attempting to reclaim the used memory.

Event Outline

Events of June 15th, 2023 (UTC):

12:55 UTC: The initial onset of service disruption was observed.
12:55 UTC: Swapcard monitoring detected a service disruption.
13:00 UTC: Swapcard Engineering identified the components as the root cause.
13:28 UTC: The incident was successfully mitigated, and the system regained its pre-incident capacity.
13:35 UTC: The status was confirmed as resolved post-incident.

Please note that the events are presented in chronological order with their respective timestamps for clarity and coherence.

Forward Planning

In accordance with our high standard in terms of deliverability, Swapcard has conduct several improvement in the memory consumption of two our our services to prevent further incident of same type. We also improve our capacity to detect earlier during the process this issues, preventing impact on our customer. Procedures and controls are already in place but this incident highlights the need for improvement.

Posted Jun 23, 2023 - 08:05 UTC

Resolved

This incident has been resolved.

Posted Jun 15, 2023 - 13:35 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Jun 15, 2023 - 13:28 UTC

Identified

The issue has been identified and a fix is being implemented.

Posted Jun 15, 2023 - 13:00 UTC

This incident affected: Studio.