Please see our post-mortem below regarding the increase response time on our chat endpoint from September 2th, 2021 at 15:38 UTC through to 15:41 UTC.
It is our goal in this post-mortem to provide details on our initial assessment of the incident communicated on the Swapcard status page and to describe the remediation actions that we have taken to restore service.
On Thursday, September 2th at 15:38 UTC, we experienced an increase response time on our event app due to one of our replicate database server being under pressure because of a specific query.
On Thursday, September 2th at 15:37 UTC, our infrastructure team has been automatically alerted of an increase response time on our event app, requests were replying in a twice higher average time compare to the usual average response time.
Swapcard monitoring detected the start of disruption at ~15:37 UTC and activated the Swapcard Incident Response team. Swapcard’s team worked to triage and mitigate the responsible database queries to alleviate customer impact. In parallel, the cause of the issue was investigated and short & mid & long term plans were put in place.
At 15:39 UTC our monitoring system had immediately turned off the responsible database that was under pressure to mitigate the impact on our services and recover a proper event app response time. The turn off process took around ~15s. Event app increase response time stopped as the database turned off configuration propagated through our endpoints. Swapcard engineering team then monitored the event app endpoints to ensure full and proper recovery.
As a result of the deployment of that change, customers would then see a reduction of the chat loading time and error messages.
At 15:43 UTC, Swapcard confirmed that the update was completed and response time restored to pre-incident levels and ensuring that the traffic was back to the pre-incident rate.
Swapcard’s Engineering team identified the cause, and by ~15:50 UTC has re-enabled the impacted database replication, the root cause has also been identified and solved.
Time alerted to the outage: 1 minutes
Time to identify the source of disruption: ~2 minutes
Time to initiate recovery: ~15 seconds
Time to monitor and restore response time pre-crash: ~2 minutes
(15:37 UTC) | Initial onset of the response time increase
(15:38 UTC) | Disruption identified by Swapcard monitoring
(15:38 UTC) | Swapcard status post activated
(15:39 UTC) | Automatic monitoring system mitigated the issue
(15:40 UTC) | Impacted event app began to recover
(15:41 UTC) | Incident mitigated, pre-incident response time restored
(15:43 UTC) | Status post resolved
(15:50 UTC) | Swapcard Engineering identified the issue
Affected customers may have been impacted by varying degrees and with a shorter duration than described above.
Swapcard has deployed a permanent fix for this incident in accordance with our high standard in terms of deliverability.