High response time on the Event App
Incident Report for Swapcard
Postmortem

Title: Post-Mortem Analysis - January 8, 2024 Incident

I. Executive Summary:
We are issuing a post-mortem report regarding a service disruption that affected Swapcard customers on Monday, January 8th, 2024, from 8:37 UTC to 9:55 UTC. The incident was linked to a specific events configuration causing high load on our databases, specifically related to the meeting feature creating a large number of combinations due to events with extensive locations and slots, resulting in over 9 million combinations on some events.

II. Incident Overview:

  • Incident Description:
    On Monday, January 8th, 2024, at 8:37 UTC, a service disruption impacted the Swapcard platform due to high load on our databases caused by the meeting feature's extensive combinations. The incident was resolved at 9:55 UTC.
  • Timeline:

    • 8:37 UTC: Initial onset of service disruption observed, with high load on databases.
    • 8:37 UTC: Detection of service disruption by Swapcard monitoring.
    • 8:45 UTC: Identification of the meeting feature causing extensive combinations.
    • 9:55 UTC: Successful resolution of the incident and implementation of patches.
    • Post-incident: Confirmation of the resolved status on the Swapcard status page.

III. Root Cause Analysis:

  • Immediate Cause:
    The incident was triggered by a specific events configuration causing high load on the databases.
  • Underlying Causes:
    The meeting feature led to an exceptionally large number of combinations due to events with extensive locations and slots, resulting in over 9 million combinations on some events.
  • Mitigation:
    Optimization of SQL queries and indexes, and implementation of hard limits on location to prevent similar incidents.

IV. Impact Analysis:

  • User Impact:
    Event App, Exhibitor Center and Studio experienced downtime. Elevated latency was observed following the resolution of the incident for a few minutes.
  • Service Impact:
    The master database responsible for data writes on Swapcard was unaffected, meaning that data coming from integration services were unaffected. Data integrity was not compromised. Only the read-only replicas of the database were affected by this incident.

V. Mitigation Deployment:
Upon identifying the root cause, immediate actions were taken to optimize SQL queries and implement hard limits on meetings. These measures ensure the platform is better adapted to handle such use-cases in the future.

VI. Forward Planning:
In line with our commitment to service deliverability, Swapcard is undertaking a comprehensive review of the technical architecture of the meeting feature, event configuration and caching mechanisms. This incident has prompted us to enhance procedures and controls to prevent similar occurrences in the future. We appreciate your understanding and continued support.

We sincerely apologize for any inconvenience caused by this disruption. If you have further concerns or require additional information, please don't hesitate to reach out to our support team. Thank you for your patience and collaboration.

Posted Jan 08, 2024 - 14:04 UTC

Resolved
This incident has been resolved.
Posted Jan 08, 2024 - 09:55 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted Jan 08, 2024 - 09:46 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Jan 08, 2024 - 09:44 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted Jan 08, 2024 - 08:41 UTC
Investigating
We are currently investigating this issue.
Posted Jan 08, 2024 - 08:37 UTC
This incident affected: Event App and Exhibitor Center.