Please see our post-mortem below regarding the sporadic “Header Timeout” error from Sep 19, 2022 at ~19:48 UTC through to ~20:07 UTC.
It is our goal in this post-mortem to provide details on our initial assessment of the incident communicated on the Swapcard status page and to describe the remediation actions that we have taken to restore service.
On Monday, Sep 19 at ~19:48 UTC, we experienced some sporadic “Header Timeout” error on our event app & studio due to a memory leak and abrupt periodic restarts on one of ours core internal service.
On Monday, Sep 19 at ~19:50 UTC, our infrastructure team has been automatically alerted of an un usual amount of “Header Timeout” in our logs and report of displayed errors by some users.
Swapcard monitoring detected the start of disruption and activated the Swapcard Incident Response team. Swapcard’s team worked to triage and mitigate the incident by scaling the internal core item in favour of reducing the memory pressure on the current services and spread the load across largest amount of services than usual, to reduce the probability of having them restarting. In parallel, the cause of the issue was investigated and short & mid term plans were put in place.
At ~19:55 UTC our infrastructure team has immediately scaled manually the internal core item in favour of reducing the memory pressure. The scaling process took around ~7min. The error reporting stopped as the scaling propagated through our infrastructure. Swapcard engineering team then monitored application endpoints logs to ensure full and proper recovery.
As a result of the deployment of that change, customers would then see a reduction of the sporadic error message.
At ~20:02 UTC, Swapcard confirmed that the update was completed and no further error were detected or reported.
Swapcard’s Engineering team identified the root cause, has worked on proper short & mid term mitigation plan at the same time than the incident were mitigated by the Swapcard Incident Response team.
Time alerted to the issue: 2 minutes
Time to identify the source of disruption: ~5 minutes
Time to initiate recovery: ~5 minutes
Time to monitor and restore service pre-crash: ~5 minutes
(19:48 UTC) | Initial onset of the header timeout error rate increase
(19:50 UTC) | Disruption identified by Swapcard monitoring
(19:50 UTC) | Swapcard status post activated
(20:02 UTC) | Incident mitigated
(20:07 UTC) | Status post resolved
Affected customers may have been impacted by varying degrees and with a shorter duration than described above.
Swapcard has deployed a permanent mitigation for this incident in accordance with our high standard in terms of deliverability.