App Studio Analytics Outage

Incident Report for Swapcard

Postmortem

Summary

Approximately two weeks ago, users began experiencing disruption when accessing analytics data within App Studio. Initial investigation did not immediately identify the root cause; a deeper analysis ultimately traced the issue to a routine infrastructure consolidation that was capping database capacity below the threshold required to handle specific event analytics queries spanning long date ranges. Full service has since been restored following a targeted increase in database resources.

Timeline

2026-03-11 19:32 UTC — Incident created. Initial investigation launched following a rise in user-reported loading issues on analytics, despite no prior performance degradation. The initial investigation did not surface a clear root cause at this stage.
2026-03-18 10:45 UTC — Automated monitoring triggered alerts for elevated database error rates, corroborated by incoming user reports of analytics failures.
2026-03-19 09:30 UTC — Deeper analysis by the Engineering team identified the root cause: high latency and query failures isolated to specific event analytics queries over long date ranges, linked to the earlier infrastructure consolidation.
2026-03-19 09:40 UTC — Remediation plan executed: database resources scaled up to meet query load demands.
2026-03-19 10:05 UTC — Database performance stabilized; analytics functionality verified as fully restored.

Resolution

To restore service, database resources were immediately scaled up to accommodate the throughput required by analytics queries. Once additional capacity was provisioned, query timeouts ceased and application stability was reestablished. The system remained under active monitoring until no further degradation was observed, at which point the incident was marked resolved.

Root Cause Analysis

The incident was caused by a capacity mismatch introduced during a routine infrastructure consolidation. Such consolidations are a standard part of our infrastructure lifecycle and are conducted with active monitoring in place to ensure continued performance across all services. In this instance, however, the capacity allocated to the Analytics Database was capped at a level that, while adequate for standard workloads, could not sustain the computational demands of specific event analytics queries executed over long date ranges. The initial investigation did not immediately surface this connection; it was only through deeper analysis that the consolidation was identified as the underlying cause. When these queries were executed, they exceeded available processing capacity, resulting in database timeouts and repeated failures of the analytics service; conditions that persisted until capacity was restored.

The investigation also surfaced a broader pattern: a gradual uptick in user-reported loading issues over the preceding two weeks, despite analytics having previously performed reliably. This indicated the capacity cap had been quietly eroding performance well ahead of the full outage.

Conclusion

We are straightening our processes to prevent recurrence. Monitoring coverage applied during database consolidations will be strengthened to detect query latency degradation earlier, and capacity planning will be refined to better account for the demands of specific event analytics queries over long date ranges.
Looking further ahead, the team is actively working on delivering a fully revamped analytics pipeline to production later this year. This is not an incremental improvement; the new pipeline is being built to deliver deeper, richer metrics that give event organizers a comprehensive view of their event performance.

Posted Mar 19, 2026 - 12:25 UTC

Resolved

This incident has been resolved.

Posted Mar 19, 2026 - 12:19 UTC