Increase error rate on Event App, Studio & Exhibitor Center

Incident Report for Swapcard

Postmortem

We would like to provide you with a post-mortem report concerning a service delivery disruption that impacted Swapcard customers on Monday, the 23th of October, 2023, from 09:23 UTC to 09:28 UTC.

The purpose of this post-mortem is to offer insights into our initial evaluation of the incident as communicated on the Swapcard status page and to outline the remedial measures we have implemented to restore service.

Incident Overview

On Monday, October 23rd, at 09:23 UTC, there was a service disruption impacting the Swapcard apps. This disruption was a result of scheduled and routine maintenance on one of our caching clusters. Unfortunately, this maintenance unexpectedly led to queries failing. These query failures were associated with a recently introduced caching script. Following the script's implementation, the caching client did not correctly timeout on commands, as originally configured for handling system disruptions, despite the service being designed to maintain fault tolerance in the event of caching system disruptions.

Incident Timeline

Events of October 23th, 2023 (UTC):

09:23 UTC: The initial onset of service disruption was observed, with query failing/timeout affecting the Swapcard Apps.
09:23 UTC: Swapcard monitoring detected a service disruption.
09:24 UTC: Swapcard Engineering identified the caching cluster zero downtime maintenance as the root cause.
09:27 UTC: The incident was successfully mitigated, and the system regained its pre-incident capacity.
09:30 UTC: The status was confirmed as resolved post-incident.

Please note that the events are presented in chronological order with their respective timestamps for clarity and coherence.

Root Cause

The disruption was traced back to a Caching system cluster maintenance operation, which has cause an unusual interuption even if the process is common and done multiple time without issues, this recent issue it’s introduced by a new caching script that was introduce by a previous release and that prevent proper failover of the system. This resulted in service delivery issues and a disruption of service for Swapcard customers.

Mitigation Deployment

Upon identifying the root cause, we implemented a mitigation strategy to prevent further service disruptions. The caching client has been was modified to ensure that queries are not failing during routine maintenance and request get properly ejected and served by the main system if the caching mechanism is unavailable.

Forward Planning

In accordance with our commitment to maintaining high standards in service deliverability, Swapcard has taken several measures to prevent similar incidents in the future. This includes a comprehensive review of our caching failover mechanism during the zero down time maintenance. Procedures and controls are already in place, and this incident has underscored the importance of continuous improvement in our service delivery processes.

We apologize for any inconvenience this disruption may have caused and thank you for your understanding and continued support.

Posted Oct 23, 2023 - 10:25 UTC

Resolved

This incident has been resolved.

Posted Oct 23, 2023 - 09:30 UTC

Identified

The issue has been identified and a fix is being implemented.

Posted Oct 23, 2023 - 09:28 UTC

This incident affected: Event App, Studio, and Exhibitor Center.