Intermitent latency on the event app causing "Stay tuned" errors
Incident Report for Swapcard
Postmortem

We aim to present you with a post-mortem report regarding a service delivery disruption that affected Swapcard customers on Wednesday, June 14th, 2023. This incident resulted in reduced performance for certain aspects of the service, particularly during peak hours on Wednesday afternoon; causing displayed of the “Stay tuned” generic page in some particular moment.

The purpose of this post-mortem is to offer insights into our initial evaluation of the incident as communicated on the Swapcard status page and to outline the remedial measures we have implemented to restore service

Incident summary

On Wednesday, June 14th, at 14:00 UTC, we experienced latency issues and encountered an unexpected error page saying "Stay tuned." This occurred due to a sudden surge of traffic caused by a slow Distributed Denial of Service (DDoS) attack, which resulted in delays in processing requests and frequent fluctuations in the performance of our clusters.

On the afternoon of Wednesday, June 14th, our public APIs encountered a major Distributed Denial of Service (DDoS) attack. This resulted in a significant influx of approximately 3 million requests within a short span of time, causing a sudden increase in latency within our system. However, our service efficiently scaled up after detecting the high incoming traffic associated with the slow DDoS attack. It is worth noting that this particular attack occurred at a rate just below our typical rate limiting and distributed across various users & IPs, which made it difficult for our security tools and rate limiter to immediately detect it.

Right after the event, the Infrastructure Team and Security Team at Swapcard joined forces with our Site Reliability Engineers (SREs) to swiftly determine the underlying cause. They took prompt measures to address the involved IPs and users, and devised a mitigation plan to proactively prevent any potential future incidents related to the affected components. As part of this strategy, we have fine-tuned our rate limiters more aggressively to effectively safeguard the platform's stability and performance against similar slow DDoS attacks.

Earlier in the afternoon, a disturbance in traffic was identified by our monitoring systems, leading to the swift activation of the Swapcard Incident Response team. The team diligently investigated the root cause of the incidents, considering that slow Distributed Denial of Service (DDoS) attacks are typically challenging to detect due to their integration with legitimate traffic. However, the team took great care not to disrupt legitimate users by carefully adjusting the rate limiter and WAF system, ensuring an appropriate balance.

Note that affected customers may have been impacted by varying degrees and with a shorter duration than as described above.

Mitigation deployment

The interruption of service ceased as the rate limiters were fine-tuned and the Swapcard Incident Response team swiftly intervened to mitigate the effects on our customers. This incident highlighted areas where we can make enhancements to achieve even quicker scalability and absorption of traffic, especially considering the exceptionally high volume we experienced. Additionally, it emphasised the need to enhance our detection capabilities for slow Distributed Denial of Service (DDoS) attacks.

Forward Planning

The incident today has brought attention to potential enhancements we can implement. Although our existing procedures and controls are already in place, we recognise the opportunity for improvement. This proactive approach ensures that we continue to enhance the resilience of our systems and mitigate any potential disruptions

Posted Jun 15, 2023 - 15:42 UTC

Resolved
This incident has been resolved.
Posted Jun 14, 2023 - 14:23 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Jun 14, 2023 - 14:12 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted Jun 14, 2023 - 14:07 UTC
Investigating
We are currently investigating this issue.
Posted Jun 14, 2023 - 14:00 UTC
This incident affected: Event App.