We are providing a detailed post-mortem report regarding the service disruption that affected Swapcard Login on June 19th, 2026. This issue was caused by an upgrade of our cache cluster, which led to temporary login failures. The goal of this post-mortem is to share insights from our assessment and the steps taken to resolve the issue while providing transparency to our customers.
Summary
As part of a planned upgrade to increase the capacity of our cache cluster, we carried out a zero-maintenance-window upgrade, a rollout designed to require no scheduled downtime. During the rollout, a misconfiguration that was not identified in our pre-production environment affected the login service in production, resulting in a period during which users were unable to sign in.
Timeline (UTC)
[16:35] — Migration of production services to the upgraded cache cluster proceeds as part of the zero-downtime rollout.
[16:38] — The login service begins failing to connect to the cache; sign-in errors start.
[16:42] — Incident detected and investigation begins.
[16:45] — Root cause identified: corrected configuration deployed to login.
[16:48] — Login fully recovers; error rates return to normal.
Mitigation
We updated the login service to connect to the upgraded cache cluster using the correct configuration, restoring sign-in functionality. As a precaution, we reviewed all other production services and corrected one additional service that carried the same latent misconfiguration before it could cause further impact.
Root cause
During a planned upgrade to increase the capacity of our cache cluster, a misconfiguration was introduced that left the Login service unable to connect to the upgraded cluster. The issue was not identified in our pre-production environment and surfaced in production during the rollout, causing sign-in to fail until the configuration was corrected.
Next steps
We are adding automated checks to (1) detect any reference to a cache cluster scheduled for decommission and (2) verify configuration parity between our pre-production and production environments, so this class of issue is caught before reaching production in future upgrades.