Event app & Studio are unreachable

Incident Report for Swapcard

Postmortem

Please see our post-mortem below regarding a service outage that affected Swapcard customers from Feb 2th, 2023 at 11:32 UTC through to 12:21 UTC.

Impacted services :

Event App
Studio App
Exhibitor Center
Developer API

It is our goal in this post-mortem to provide details on our initial assessment of the incident communicated on the Swapcard status page and to describe the remediation actions that we have taken to restore service.

Incident summary

On Thursday, Feb 2 at ~11:32 UTC, we experienced an major outage on our event app & studio, exhibitor center and developer-api due to an high number of un-finished/stacking database sessions on our main core databases (master & replicas).

On Thursday, Feb 2 at ~11:33 UTC, our infrastructure team has been automatically alerted of an high number of database sessions on our main core database (master & replicas), the number of sessions increases over the time one database replica, then start to propagate to the others replicas that part of our Multi-AZ deployments for high availability.

Swapcard monitoring detected the start of disruption and activated the Swapcard Incident Response team. Swapcard’s team worked to triage and mitigate the incident according to our internal documentation by redirecting the database requests to the others replicas that are in place in case of major disruption in one of the database node. Unfortunately this action didn’t end-up in a service recovery like initially expected by the Swapcard Incident Response team. As explained earlier in the post mortem, the issue were propagating to our Multi-AZ nodes as-well, that were in place for mitigating such incidents.

In parallel, the cause of the issue was investigated so short term plans were put in place.

The second mitigation attempt lead to an global database restart, leading to longer resolution time, this plan was not initially considered knowing that this action were potentially extending the resolution time.

Mitigation deployment

At 11:50 UTC, Swapcard’s Engineering team that were investigating the initial sunset of the incident, identified the initial cause of this incident, an difference in the minor version between our master & replicas database in addition of a specific database operation has cause an table lock leading to database sessions to increase and stack. The lock happened on a database table with a high volume of requests per seconds, this table is use for rendering a major part of the contents.

The specific operation was an internal database operation that was not initially available for manual termination, forcing a restart to ensure lock being releases.

By 12:19 UTC service outage stopped as the update propagated through the databases. Swapcard engineering then monitored all the services to ensure full and proper recovery by 12:21 UTC.

At 12:21 UTC, Swapcard confirmed that the update was completed and capacity restored to pre-incident levels, ensuring that the traffic was back to the pre-incident rate.

Event Outline

Duration Summary

Time alerted to the outage: 1 minutes

Time to identify the source of disruption: 1 minutes

Time to initiate recovery (1st attempt) : 7 minutes

Time to initiate recovery (2nd attempt) : 35 minutes

Time to monitor and restore capacities pre-crash: 5 minutes

Events of 2023 Feb 2 (UTC)

(11:32 UTC) | Initial onset of core database outage

(11:32 UTC) | Service outage identified by Swapcard monitoring

(11:32 UTC) | Swapcard status post is activated

(11:33 UTC) | Swapcard Engineering identified an high number of database sessions

(11:39 UTC) | 1st attempt of mitigation

(11:50 UTC) | Swapcard Engineering identified the initial cause of the issue.

(11:50 UTC) | 2nd attempt of mitigation

(12:19 UTC) | Majority of services recovered, additional mitigation measures taken

(12:21 UTC) | Incident Mitigated, pre-incident capacity restored

(12:21 UTC) | Status post resolved

Affected customers may have been impacted by varying degrees and with a shorter duration than as described above.

Forward Planning

Swapcard has deployed a permanent fix for this incident and will implemented technical measures to ensure that any database internal operations being identified earlier in addition of adding a procedure for preventing propagation on the Multi-AZ nodes.

In accordance with our high standard in terms of deliverability, Swapcard will conduct an internal audit on the database configuration, version and internal procedure to ensure to prevents similar incidents. Procedure and control are already in place but today’s incident highlights the need for improvement.

We consider the likelihood of a recurrence of this issue to be extremely low and will further reduce the risk as we make future interventions and improvements to our infrastructure and procedures.

Posted Feb 02, 2023 - 14:39 UTC

Resolved

This incident has been resolved.

Posted Feb 02, 2023 - 12:21 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Feb 02, 2023 - 12:20 UTC

Identified

The issue has been identified and a fix is being implemented.

Posted Feb 02, 2023 - 11:50 UTC

Investigating

We are currently investigating this issue.

Posted Feb 02, 2023 - 11:32 UTC

This incident affected: Event App.