Delay on messaging, notifications and simulated live stream processing
Incident Report for Swapcard
Postmortem

Please see our post-mortem below regarding a service delivery disruption that affected Swapcard customers from Tuesday, June 28th 2023 at 20:39 UTC through to 23:28 UTC.

It is our goal in this post-mortem to provide details on our initial assessment of the incident and to describe the remediation actions that we have taken to restore service.

Incident summary

On Tuesday, June 28th at 20:39 UTC, we experienced an outage due to an internal queue messaging system used by several Swapcard core services, such as messaging, notifications and live stream processing.

On Tuesday, June 28th at 20:45 UTC, the automated monitoring system triggered an on-call response from the Incident Response team.

On Tuesday, June 28th at 20:47 UTC, the alarm has been acknowledged by the Incident Response team.

On Tuesday, June 28th at 21:06 UTC, the Incident Response team identified the issue related to the internal messaging system, a message queue was full and disrupting performance for the other queues present in the same system.

On Tuesday, June 28th at 21:32 UTC, the Incident Response team triggered a capacity upgrade of the affected service, responsible for consuming the excess messages buildup in an attempt to restore service. This change was applied in accordance with Swapcard standard infrastructure & security change and enhancement practices.

On Tuesday, June 28th at 23:21 UTC, the Incident Response team monitored the results and made sure that the internal messaging system came back to nominal levels.

On Tuesday, June 28th at 23:34 UTC, the Incident Response team resolved the incident.

We are currently investigating the root cause of the incident, which led to the buildup of messages inside the internal queue messaging system.

Swapcard monitoring detected the disruption at 20:45 UTC and activated the Swapcard Incident Response team. Swapcard’s team worked to triage and restore services to alleviate customer impact. In parallel, the cause of the issue is being investigated and mitigations were put in place.

Mitigation deployment

To ensure proper processing of all messages, the service responsible for handling these messages has been scaled-up to process more messages than nominal levels to compensate for the buildup.

As a result of restoring the system, customers would then see a reduction in the delay and processing of messages, notifications and live stream usage.

At 23:34 UTC, Swapcard confirmed that the incident was resolved and delay restored to pre-incident levels, ensuring that the processing speed was back to the pre-incident rate.

Event Outline

Events of 2023 June 28 (UTC)

(20:39 UTC) | Initial delays start happening in messaging, notifications and live stream

(20:45 UTC) | Disruption identified by Swapcard automated monitoring systems

(20:47 UTC) | Swapcard Engineering acknowledged the issue

(21:06 UTC) | Swapcard Engineering identified the cause of the disruption

(21:32 UTC) | Swapcard Engineering triggered a scale-up of the affected service in an attempt to restore service

(23:21 UTC) | Swapcard Engineering monitored the results

(23:34 UTC) | Swapcard Engineering resolved the incident

Affected customers may have been impacted by varying degrees and with a shorter duration than as described above.

Forward Planning

In accordance with our high standard in terms of deliverability, Swapcard will conduct an internal audit on the on-call procedure, which didn’t trigger a status page update during the incident. Swapcard will also take measures to improve the monitoring systems on the affected internal messaging system to avoid service disruption. Procedures and controls are already in place but today’s incident highlights the need for improvement.

We consider the likelihood of a recurrence of this issue to be low and will further reduce the risk as we make future interventions and improvements to our infrastructure and procedures.

Posted Jun 28, 2023 - 14:09 UTC

Resolved
The internal queuing system, used by multiple systems inside the Swapcard platform, suffered an outage which added delays in processing of notifications and streaming services.
Posted Jun 27, 2023 - 23:00 UTC