Please see our post-mortem below regarding a service disruption that affected Swapcard customers from March 29th, 2023 at 12:58 UTC through to 13:08 UTC.
It is our goal in this post-mortem to provide details on our initial assessment of the incident communicated on the Swapcard status page and to describe the remediation actions that we have taken to restore service.
On March 29th, 2023 at 12:58 UTC, we experienced an increase in latency on the Event App due to an excess number of requests on our API gateway.
On March 29th, 2023 at 13:02 UTC, the infrastructure team triggered a capacity upgrade on our main API in favor of improving the latency and user experience on the current & up-coming events. This change didn’t affect latency as we hoped so.
The team continued to investigate using Swapcard’s internal profiling tools and found that the APIs weren’t receiving the spike of traffic observed on our gateway. It has been determined that our main gateway didn’t scale according to network traffic, causing a network bottleneck and increasing latency.
Swapcard monitoring detected the traffic disruption at 12:59 UTC and activated the Swapcard Incident Response team. Swapcard’s team worked to triage and restore services to alleviate customer impact. In parallel, the cause of the issue was investigated and mitigations were put in place.
Latency dropped as the same time than the capacity upgrade on our main gateway was triggered. Swapcard Engineering then monitored all the services to ensure full and proper recovery by 13:11 UTC.
At 13:11 UTC, Swapcard confirmed that the scaling was completed and latency restored to pre-incident levels, ensuring that the traffic was back to the pre-incident rate.
Time alerted to the outage: 1 minute
Time to identify the source of disruption: 5 minutes
Time to initiate recovery: 1 minute
Time to monitor and restore capacities pre-incident: 7 minutes
(12:58 UTC) | Event App latency increased
(12:59 UTC) | Swapcard automated monitoring alerted the infrastructure team
(13:02 UTC) | Swapcard Engineering pre-emptively scaled main API to lower latency
(13:05 UTC) | Swapcard Engineering found a network bottleneck on gateway using internal profiling tools
(13:06 UTC) | Swapcard Engineering triggered a capacity upgrade on main gateway
(13:08 UTC) | Event App latency goes back to pre-incident levels while Swapcard Engineering is monitoring internal systems
(13:10 UTC) | No internal systems were affected, incident mitigated
(13:11 UTC) | Status post resolved
Affected customers may have been impacted by varying degrees and with a shorter duration than as described above.
In accordance with our high standard in terms of deliverability, Swapcard will conduct an internal audit on the autoscaling methods used by the main gateway. Automatic capacity upgrades were already in place but today’s incident highlights the need for improvement.
We consider the likelihood of a recurrence of this issue to be low and will further reduce the risk as we make future interventions and improvements to our infrastructure and procedures.