504 gateway timeout error on the Developer API
Incident Report for Swapcard
Postmortem

We are prepared to provide a detailed post-mortem report regarding a service disruption that impacted Swapcard customers on Wednesday, October 18th, 2023. During this incident, we encountered intermittent 504 gateway timeout errors on the Developer API.

The purpose of this post-mortem is to share insights into our initial assessment of the situation, as communicated on the Swapcard status page, and to outline the corrective actions we've taken to restore normal service.

Incident Overview

On Wednesday, October 18th, at approximately 4 PM UTC, we observed a surge in 504 gateway timeout errors on the Developer API. This issue affected various external system integrations, excluding those provided by the Studio. Please note that the impact on affected customers may have varied in duration and severity.

After conducting a thorough investigation, it was determined that the problem stemmed from a connectivity issue within our primary developer gateway. This issue led to routing problems, resulting in only one-third of the HTTP requests made during that period reaching the appropriate backend Developer APIs. Our Swapcard Response Team, in collaboration with other departments, identified and resolved the connectivity issue within approximately one hour from the initial report.

Mitigation and Resolution

The service interruption was promptly addressed as the network connectivity between the developer gateway and related backends was restored. Our Swapcard Incident Response team acted swiftly to mitigate the impact on our customers. This incident highlighted areas where we can make improvements to enable faster diagnosis of connectivity issues, network congestion, or related problems.

Future Planning

This incident has underscored the potential for enhancements in our processes and controls. While we already have established procedures in place, we acknowledge the opportunity for improvement. This proactive approach ensures that we continually strengthen the resilience of our systems and minimize the potential for disruptions.

Posted Oct 23, 2023 - 09:53 UTC

Resolved
This incident has been resolved.
Posted Oct 18, 2023 - 16:00 UTC