We would like to provide you with a post-mortem report concerning a service delivery disruption that impacted Swapcard customers on Monday, the 23th of October, 2023, from 09:23 UTC to 09:28 UTC.
The purpose of this post-mortem is to offer insights into our initial evaluation of the incident as communicated on the Swapcard status page and to outline the remedial measures we have implemented to restore service.
On Monday, October 23rd, at 09:23 UTC, there was a service disruption impacting the Swapcard apps. This disruption was a result of scheduled and routine maintenance on one of our caching clusters. Unfortunately, this maintenance unexpectedly led to queries failing. These query failures were associated with a recently introduced caching script. Following the script's implementation, the caching client did not correctly timeout on commands, as originally configured for handling system disruptions, despite the service being designed to maintain fault tolerance in the event of caching system disruptions.
Events of October 23th, 2023 (UTC):
Please note that the events are presented in chronological order with their respective timestamps for clarity and coherence.
The disruption was traced back to a Caching system cluster maintenance operation, which has cause an unusual interuption even if the process is common and done multiple time without issues, this recent issue it’s introduced by a new caching script that was introduce by a previous release and that prevent proper failover of the system. This resulted in service delivery issues and a disruption of service for Swapcard customers.
Upon identifying the root cause, we implemented a mitigation strategy to prevent further service disruptions. The caching client has been was modified to ensure that queries are not failing during routine maintenance and request get properly ejected and served by the main system if the caching mechanism is unavailable.
In accordance with our commitment to maintaining high standards in service deliverability, Swapcard has taken several measures to prevent similar incidents in the future. This includes a comprehensive review of our caching failover mechanism during the zero down time maintenance. Procedures and controls are already in place, and this incident has underscored the importance of continuous improvement in our service delivery processes.
We apologize for any inconvenience this disruption may have caused and thank you for your understanding and continued support.