Event app increase response time

Incident Report for Swapcard

Postmortem

Please see our post-mortem below regarding a service disruption that affected Swapcard customers from March 29th, 2023 at 12:58 UTC through to 13:08 UTC.

It is our goal in this post-mortem to provide details on our initial assessment of the incident communicated on the Swapcard status page and to describe the remediation actions that we have taken to restore service.

Incident summary

On March 29th, 2023 at 12:58 UTC, we experienced an increase in latency on the Event App due to an excess number of requests on our API gateway.

On March 29th, 2023 at 13:02 UTC, the infrastructure team triggered a capacity upgrade on our main API in favor of improving the latency and user experience on the current & up-coming events. This change didn’t affect latency as we hoped so.
The team continued to investigate using Swapcard’s internal profiling tools and found that the APIs weren’t receiving the spike of traffic observed on our gateway. It has been determined that our main gateway didn’t scale according to network traffic, causing a network bottleneck and increasing latency.

Swapcard monitoring detected the traffic disruption at 12:59 UTC and activated the Swapcard Incident Response team. Swapcard’s team worked to triage and restore services to alleviate customer impact. In parallel, the cause of the issue was investigated and mitigations were put in place.

Mitigation deployment

Latency dropped as the same time than the capacity upgrade on our main gateway was triggered. Swapcard Engineering then monitored all the services to ensure full and proper recovery by 13:11 UTC.

At 13:11 UTC, Swapcard confirmed that the scaling was completed and latency restored to pre-incident levels, ensuring that the traffic was back to the pre-incident rate.

Event Outline

Duration Summary

Time alerted to the outage: 1 minute

Time to identify the source of disruption: 5 minutes

Time to initiate recovery: 1 minute

Time to monitor and restore capacities pre-incident: 7 minutes

Events of 2023 Mar 29 (UTC)

(12:58 UTC) | Event App latency increased

(12:59 UTC) | Swapcard automated monitoring alerted the infrastructure team

(13:02 UTC) | Swapcard Engineering pre-emptively scaled main API to lower latency

(13:05 UTC) | Swapcard Engineering found a network bottleneck on gateway using internal profiling tools

(13:06 UTC) | Swapcard Engineering triggered a capacity upgrade on main gateway

(13:08 UTC) | Event App latency goes back to pre-incident levels while Swapcard Engineering is monitoring internal systems

(13:10 UTC) | No internal systems were affected, incident mitigated

(13:11 UTC) | Status post resolved

Affected customers may have been impacted by varying degrees and with a shorter duration than as described above.

Forward Planning

In accordance with our high standard in terms of deliverability, Swapcard will conduct an internal audit on the autoscaling methods used by the main gateway. Automatic capacity upgrades were already in place but today’s incident highlights the need for improvement.

We consider the likelihood of a recurrence of this issue to be low and will further reduce the risk as we make future interventions and improvements to our infrastructure and procedures.

Posted Mar 29, 2023 - 14:02 UTC

Resolved

This incident has been resolved.

Posted Mar 29, 2023 - 13:11 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Mar 29, 2023 - 13:08 UTC

Investigating

On Wednesday, March 29th at 12:58 UTC, we experienced an increase response time on our event app.

Posted Mar 29, 2023 - 13:05 UTC

This incident affected: Event App.