Investigating Headers Timeout Error

Resolved·Degraded performance

We’ve published a write-up of this incidentRead the write-up

Read it here

Affected components

Sep 19, 2022, 07:50 PM

08:07 PM

Event App

Updates

Write-up published

Read it here

Resolved

Please see our post-mortem below regarding the sporadic “Header Timeout” error from Sep 19, 2022 at ~19:48 UTC through to ~20:07 UTC.

It is our goal in this post-mortem to provide details on our initial assessment of the incident communicated on the Swapcard status page and to describe the remediation actions that we have taken to restore service.

Incident summary

On Monday, Sep 19 at ~19:48 UTC, we experienced some sporadic “Header Timeout” error on our event app & studio due to a memory leak and abrupt periodic restarts on one of ours core internal service.

On Monday, Sep 19 at ~19:50 UTC, our infrastructure team has been automatically alerted of an un usual amount of “Header Timeout” in our logs and report of displayed errors by some users.

Swapcard monitoring detected the start of disruption and activated the Swapcard Incident Response team. Swapcard’s team worked to triage and mitigate the incident by scaling the internal core item in favour of reducing the memory pressure on the current services and spread the load across largest amount of services than usual, to reduce the probability of having them restarting. In parallel, the cause of the issue was investigated and short & mid term plans were put in place.

Mitigation deployment

At ~19:55 UTC our infrastructure team has immediately scaled manually the internal core item in favour of reducing the memory pressure. The scaling process took around ~7min. The error reporting stopped as the scaling propagated through our infrastructure. Swapcard engineering team then monitored application endpoints logs to ensure full and proper recovery.

As a result of the deployment of that change, customers would then see a reduction of the sporadic error message.

At ~20:02 UTC, Swapcard confirmed that the update was completed and no further error were detected or reported.

Swapcard’s Engineering team identified the root cause, has worked on proper short & mid term mitigation plan at the same time than the incident were mitigated by the Swapcard Incident Response team.

Event Outline

Duration Summary

Time alerted to the issue: 2 minutes

Time to identify the source of disruption: ~5 minutes

Time to initiate recovery: ~5 minutes

Time to monitor and restore service pre-crash: ~5 minutes

Events of 2022 Sep 19 \(UTC\)

\(19:48 UTC\) | Initial onset of the header timeout error rate increase

\(19:50 UTC\) | Disruption identified by Swapcard monitoring

\(19:50 UTC\) | Swapcard status post activated

\(20:02 UTC\) | Incident mitigated

\(20:07 UTC\) | Status post resolved

Affected customers may have been impacted by varying degrees and with a shorter duration than described above.

Forward Planning

Swapcard has deployed a permanent mitigation for this incident in accordance with our high standard in terms of deliverability.

Mon, Sep 26, 2022, 01:58 PM

Resolved

This incident has been resolved.

Mon, Sep 19, 2022, 08:07 PM(6 days earlier)

Monitoring

A fix has been implemented and we are monitoring the results.

Mon, Sep 19, 2022, 08:02 PM

Investigating

We are currently investigating this issue.

Mon, Sep 19, 2022, 07:50 PM(12 minutes earlier)