Issue to logged-in & unusual logged-out causing various internal server error
Incident Report for Swapcard
Postmortem

We would like to present you with a retrospective report regarding a service disruption that affected Swapcard customers on the 18th of September, 2023, from 4:55 AM UTC to approximately 8:00 AM UTC.

The purpose of this retrospective is to provide insights into our initial assessment of the incident, as communicated on the Swapcard status page, and to outline the corrective actions we've taken to restore service. This incident pertains to issues related to logged-in and unusual logged-out scenarios, which resulted in several internal server errors.

Incident summary

On Monday, September 18th, at 04:55 AM UTC, we encountered an issue with user logins and logouts, along with various internal errors on the delivery of the service. The nature of the problem was inconsistent, which led to a slightly longer resolution time than usual. This was primarily because not all requests were failing, and users could still access the product.

The root cause of this problem was the automatic activation of our security protection system when it detected a potentially malicious or system-flooding. Specifically, one particular request triggered our Web Application Firewall (WAF) to flag certain IPs as a security precaution. Subsequently, our corporate rules blocked actions originating from these flagged IPs. Unfortunately, this also affected one of Swapcard's own IP addresses being incorrectly blocked, resulting in a range of issues.

Immediately following the incident, the Infrastructure Team at Swapcard, in collaboration with our Site Reliability Engineers (SREs) & Security Team, identified the root cause and implemented a mitigation strategy to prevent incidents related to the affected components, specifically knowing the criticality of this components inside our architecture.

Our systems & team detected a disruption in traffic, and as a result, the Swapcard Incident Response team was promptly activated. Our team worked diligently to prioritise and restore the quality of services to minimize the impact on our customers. Concurrently, we conducted a thorough investigation into the cause of the issue and implemented mitigations with the assistance of dedicated members from the Infrastructure and SRE teams.

Mitigation deployment

As part of our mitigation plan, we are enhancing our Web Application Firewall (WAF) system for flagged IPs by implementing a range of rules. This is aimed at preventing recurring issues where certain IPs are mistakenly flagged due to the actions of an individual who has been flooding our system.

Event Outline

Events of September 18th, 2023 (UTC):

  • 4:55 AM UTC: The initial onset of service disruption was observed.
  • 7:30 AM UTC: Swapcard Engineering identified the components as the root cause.
  • 7:55 AM UTC: The incident was successfully mitigated, and the system regained its pre-incident capacity.
  • 8:00 AM UTC: The status was confirmed as resolved post-incident.

Please note that the events are presented in chronological order with their respective timestamps for clarity and coherence.

Forward Planning

In accordance with our high standard in terms of deliverability, Swapcard has conduct and planned several improvement on the relative components to prevent further incident of same type. We also improve our capacity to detect earlier this issues, for preventing impact on our customer. Procedures and controls are already in place but this incident highlights the need for improvement.

Posted Sep 18, 2023 - 12:38 UTC

Resolved
The incident has been successfully resolved, and we will conduct a more detailed post-mortem to identify the root cause. In the initial investigation, we discovered a problem with the firewall that resulted in our datacenter IPs being mistakenly identified as flooding our server. As a result, our system blocked the IPs of one of our servers, leading to various issues.
Posted Sep 18, 2023 - 08:08 UTC