Event app is unreachable in some regions

Incident Report for Swapcard

Postmortem

Please see our post-mortem below regarding a service delivery disruption that affected Swapcard customers from Thursday 9th, 2023 at 16:20 UTC through to 16:35 UTC.

It is our goal in this post-mortem to provide details on our initial assessment of the incident communicated on the Swapcard status page and to describe the remediation actions that we have taken to restore service.

Incident summary

On Tuesday, Thursday 9th at 16:20 UTC, we experienced an outage due to an capacity upgrade on our cache system, causing issue in the delivery of the event app across several regions.

On Tuesday, Thursday 9th at 16:20 UTC, the infrastructure team triggered a capacity upgrade on our caching system in favour of improving the latency and user experience on the current & up-coming events, this action is commonly execute and part of recurrent task. The same change was applied to our production account by our infrastructure as code tool in a dry run mode previous days in advance to ensure and prevent impact on the production environment. This caused a significant disruption of traffic and an outage of the Event app. This change was applied in accordance with Swapcard standard infrastructure & security change and enhancement practices.

We currently investigating our cache system configuration, to find the difference between previous dry-run and the failure upgrade of today.

Swapcard monitoring detected the traffic disruption at 16:20 UTC and activated the Swapcard Incident Response team. Swapcard’s team worked to triage and restore services to alleviate customer impact. In parallel, the cause of the issue was investigated and mitigations were put in place.

Mitigation deployment

Traffic disruption stopped as the same time than the cache capacity upgrade were reverted. Swapcard engineering then monitored all the services to ensure full and proper recovery by 16:35 UTC.

As a result of restoring the cache system, customers would then see a reduction in the errors.

At 16:35 UTC, Swapcard confirmed that the update was completed and capacity restored to pre-incident levels, ensuring that the traffic was back to the pre-incident rate.

Event Outline

Duration Summary

Time alerted to the outage: 1 minutes

Time to identify the source of disruption: 1 minutes

Time to initiate recovery: 5 minutes

Time to monitor and restore capacities pre-crash: 8 minutes

Events of 2023 Feb 9 (UTC)

(16:20 UTC) | Initial onset of cache disruption

(16:20 UTC) | Global cache disruption identified by Swapcard monitoring

(16:21 UTC) | Swapcard Engineering identified the causing capacity upgrade

(16:32 UTC) | Impacted services began to recover

(16:33 UTC) | Majority of services recovered, additional mitigation measures taken

(16:35 UTC) | Incident Mitigated, pre-incident capacity restored

(16:44 UTC) | Status post resolved

Affected customers may have been impacted by varying degrees and with a shorter duration than as described above.

Forward Planning

In accordance with our high standard in terms of deliverability, Swapcard will conduct an internal audit on the cache upgrade timeline and procedure. Procedure and control are already in place but today’s incident highlights the need for improvement.

We consider the likelihood of a recurrence of this issue to be extremely low and will further reduce the risk as we make future interventions and improvements to our infrastructure and procedures.

Posted Feb 09, 2023 - 17:05 UTC

Resolved

A fix has been implemented and we are monitoring the results.

This incident has been resolved

Posted Feb 09, 2023 - 16:44 UTC