Latency increase

Incident Report for Swapcard

Postmortem

Please see our post-mortem below regarding the increase response time on our chat endpoint from September 2th, 2021 at 15:38 UTC through to 15:41 UTC.

It is our goal in this post-mortem to provide details on our initial assessment of the incident communicated on the Swapcard status page and to describe the remediation actions that we have taken to restore service.

Incident summary

On Thursday, September 2th at 15:38 UTC, we experienced an increase response time on our event app due to one of our replicate database server being under pressure because of a specific query.

On Thursday, September 2th at 15:37 UTC, our infrastructure team has been automatically alerted of an increase response time on our event app, requests were replying in a twice higher average time compare to the usual average response time.

Swapcard monitoring detected the start of disruption at ~15:37 UTC and activated the Swapcard Incident Response team. Swapcard’s team worked to triage and mitigate the responsible database queries to alleviate customer impact. In parallel, the cause of the issue was investigated and short & mid & long term plans were put in place.

Mitigation deployment

At 15:39 UTC our monitoring system had immediately turned off the responsible database that was under pressure to mitigate the impact on our services and recover a proper event app response time. The turn off process took around ~15s. Event app increase response time stopped as the database turned off configuration propagated through our endpoints. Swapcard engineering team then monitored the event app endpoints to ensure full and proper recovery.

As a result of the deployment of that change, customers would then see a reduction of the chat loading time and error messages.

At 15:43 UTC, Swapcard confirmed that the update was completed and response time restored to pre-incident levels and ensuring that the traffic was back to the pre-incident rate.

Swapcard’s Engineering team identified the cause, and by ~15:50 UTC has re-enabled the impacted database replication, the root cause has also been identified and solved.

Event Outline

Duration Summary

Time alerted to the outage: 1 minutes

Time to identify the source of disruption: ~2 minutes

Time to initiate recovery: ~15 seconds

Time to monitor and restore response time pre-crash: ~2 minutes

Events of 2021 September 2th (UTC)

(15:37 UTC) | Initial onset of the response time increase

(15:38 UTC) | Disruption identified by Swapcard monitoring

(15:38 UTC) | Swapcard status post activated

(15:39 UTC) | Automatic monitoring system mitigated the issue

(15:40 UTC) | Impacted event app began to recover

(15:41 UTC) | Incident mitigated, pre-incident response time restored

(15:43 UTC) | Status post resolved

(15:50 UTC) | Swapcard Engineering identified the issue

Affected customers may have been impacted by varying degrees and with a shorter duration than described above.

Forward Planning

Swapcard has deployed a permanent fix for this incident in accordance with our high standard in terms of deliverability.

Posted Sep 07, 2021 - 15:42 UTC

Resolved

This incident has been resolved.

Posted Sep 02, 2021 - 15:43 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Sep 02, 2021 - 15:40 UTC

Identified

The issue has been identified and a fix is being implemented.

Posted Sep 02, 2021 - 15:38 UTC

This incident affected: Event App.