History

May
No incidents reported
April
No incidents reported
March
19
Thu
App Studio Analytics Outage
12:21 PM
Summary

Approximately two weeks ago, users began experiencing disruption when accessing analytics data within App Studio. Initial investigation did not immediately identify the root cause; a deeper analysis ultimately traced the issue to a routine infrastructure consolidation that was capping database capacity below the threshold required to handle specific event analytics queries spanning long date ranges. Full service has since been restored following a targeted increase in database resources.

Timeline
  • 2026-03-11 19:32 UTC — Incident created. Initial investigation launched following a rise in user-reported loading issues on analytics, despite no prior performance degradation. The initial investigation did not surface a clear root cause at this stage.

  • 2026-03-18 10:45 UTC — Automated monitoring triggered alerts for elevated database error rates, corroborated by incoming user reports of analytics failures.

  • 2026-03-19 09:30 UTC — Deeper analysis by the Engineering team identified the root cause: high latency and query failures isolated to specific event analytics queries over long date ranges, linked to the earlier infrastructure consolidation.

  • 2026-03-19 09:40 UTC — Remediation plan executed: database resources scaled up to meet query load demands.

  • 2026-03-19 10:05 UTC — Database performance stabilized; analytics functionality verified as fully restored.

Resolution

To restore service, database resources were immediately scaled up to accommodate the throughput required by analytics queries. Once additional capacity was provisioned, query timeouts ceased and application stability was reestablished. The system remained under active monitoring until no further degradation was observed, at which point the incident was marked resolved.

Root Cause Analysis

The incident was caused by a capacity mismatch introduced during a routine infrastructure consolidation. Such consolidations are a standard part of our infrastructure lifecycle and are conducted with active monitoring in place to ensure continued performance across all services. In this instance, however, the capacity allocated to the Analytics Database was capped at a level that, while adequate for standard workloads, could not sustain the computational demands of specific event analytics queries executed over long date ranges. The initial investigation did not immediately surface this connection; it was only through deeper analysis that the consolidation was identified as the underlying cause. When these queries were executed, they exceeded available processing capacity, resulting in database timeouts and repeated failures of the analytics service; conditions that persisted until capacity was restored.

The investigation also surfaced a broader pattern: a gradual uptick in user-reported loading issues over the preceding two weeks, despite analytics having previously performed reliably. This indicated the capacity cap had been quietly eroding performance well ahead of the full outage.

Conclusion

We are straightening our processes to prevent recurrence. Monitoring coverage applied during database consolidations will be strengthened to detect query latency degradation earlier, and capacity planning will be refined to better account for the demands of specific event analytics queries over long date ranges.Looking further ahead, the team is actively working on delivering a fully revamped analytics pipeline to production later this year. This is not an incremental improvement; the new pipeline is being built to deliver deeper, richer metrics that give event organizers a comprehensive view of their event performance.

13
Fri
Speaker & Attendee Profile Visibility Issue
2:02 AM
Summary

On March 11, 2026, we experienced a service disruption affecting the visibility of speaker and attendee profiles across multiple events. The incident was triggered by a timing issue during a scheduled infrastructure update — a data refresh was initiated before new field definitions had fully propagated to our Search Engine, causing the indexation system to be built from an outdated configuration.

Our Search Engine relies on an indexation system to serve profile data quickly across events. When the refresh completed against the stale configuration, new fields were rejected due to strict schema enforcement, leaving profiles absent from search results and listings.

The goal of this post-mortem is to share our assessment and the steps taken to resolve the issue, while providing full transparency to our customers.

Timeline
Resolution

Once the root cause was identified, the team immediately initiated a new data refresh against the correct, fully propagated configuration. Profile visibility and search behavior were monitored throughout the process. Once the refresh completed, all profiles returned to normal search and listing behavior.

Root Cause Analysis

The root cause was a race condition between the deployment of new Search Engine field definitions and the triggering of a data refresh. The refresh started before the new configuration had fully propagated, causing the indexation system to be built from a stale schema. Due to strict schema enforcement, subsequent write operations referencing the new fields were rejected, leaving profiles absent from search results and listings.

Several factors contributed to this incident:

  • Ambiguous deployment signal: A pipeline warning created uncertainty around whether the deployment had completed cleanly.

  • Insufficient spot checks: Post-refresh verification passed at a surface level, masking the underlying schema mismatch until live traffic exposed it.

Short-term improvements:

  • Add a mandatory pre-refresh validation step to confirm the Search Engine schema matches the expected configuration before any refresh is initiated.

  • Update the release process to include explicit schema validation checkpoints for sub field, as a required step.

Conclusion

This incident stemmed from a timing edge case in our release process. The improvements listed above reflect our commitment to preventing similar incidents and maintaining a reliable platform for all our customers.

For any questions or concerns regarding this incident, please reach out to our support team.

February
13
Fri
SwapAccess issue in badge printing
3:51 AM

We are providing a detailed post-mortem report regarding the service disruption that affected Swapcard customers on February 12th, 2026, from 09:00 UTC to 09:30 UTC. This issue was caused by an unprecedented traffic spike that exceeded the capacity of one of our internal services, leading to temporary degraded performance across badge printing \(SwapAccess\) and Studio access.

The goal of this post-mortem is to share insights from our assessment and the steps taken to resolve the issue while providing transparency to our customers.

Summary

On February 12th, 2026, Swapcard experienced a 30-minute service disruption affecting badge printing via the Check-in App \(SwapAccess\) and access to the Studio interface. Customers with live events during this window experienced printing delays, degraded Studio access, and intermittent error messages.

The incident was caused by an unusual and unprecedented volume of concurrent traffic hitting one of our internal services. This traffic pattern had not been observed before and exceeded the scaling thresholds configured at the time. The sudden load caused elevated response times that cascaded to downstream services, including badge generation and Studio access. The platform self-recovered as traffic levels normalized around 09:30 UTC.

Timeline

09:00 UTC | An unprecedented spike in concurrent traffic began hitting one of our core internal services, exceeding previously observed traffic patterns.

09:00–09:15 UTC | The service could not scale fast enough to absorb the sudden load. Elevated latency cascaded to dependent services, causing badge generation timeouts and degraded Studio access.

09:15–09:30 UTC | Traffic levels began to normalize. The platform progressively recovered as request queues cleared.

09:30 UTC | Full service restoration. Badge printing and Studio access returned to normal operation.

Onsite report | Our infrastructure team conducted a thorough investigation and immediately applied improvements to prevent recurrence.

Root Cause Analysis

The root cause of this incident was an unprecedented and sudden spike in concurrent traffic that exceeded the scaling capacity of one of our internal services. This traffic pattern had not been encountered before in production, and the service's auto-scaling configuration was not tuned to react quickly enough to absorb such a rapid increase.

As response times on this service climbed, the impact cascaded to dependent features — including badge generation and Studio, which rely on it for real-time operations. This cascading effect amplified the user-facing impact beyond what the initial traffic surge alone would have caused.

Remediation

Following this incident, our infrastructure team immediately took action to strengthen the resilience of the affected services:

Scaling improvements: We have significantly increased the resource capacity and improved the scaling configuration of the affected internal service to handle traffic spikes well beyond the levels observed during this incident. The service can now absorb sudden surges much more effectively.

Resource optimization: We have optimized the resource usage of the service to ensure it operates more efficiently under load, reducing the likelihood of capacity issues even during unexpected traffic peaks.

We have deployed additional monitoring and alerting specifically targeting the failure patterns observed during this incident. This ensures that if a similar traffic surge were to occur, our team would be automatically notified within seconds and could intervene proactively before any customer-facing impact.

Conclusion

This incident was caused by an unpredictable traffic pattern that had not been previously observed on our platform. While the disruption was brief, we understand the impact it had on customers running live events during that window, and we take that seriously.

The scaling, optimization, and alerting improvements we have put in place significantly reduce the risk of a similar incident occurring in the future. Our infrastructure team continues to monitor the situation closely.

If you have any questions or concerns regarding this incident, please don't hesitate to reach out to our support team.