Partial Service Degradation Due to Upstream Provider Outage

Incident Report for Blitzz

Postmortem

What happened

On October 20, 2025, Amazon Web Services (AWS) experienced a major outage in its US-EAST-1 region. The incident began with “increased error rates and latencies for multiple AWS services” in that region.

Because our platform and video-services infrastructure rely on AWS regions including US-EAST-1, we observed degraded performance and partial service disruption even though we operate redundant, multi-region deployments.

Root cause

AWS attributed the outage to issues originating within the EC2 internal network in US-EAST-1, specifically “significant error rates for requests made to the DynamoDB endpoint” in that region.

Because many AWS services depend on that foundational component, the failure cascaded beyond a single Availability Zone and impacted broader service connectivity.

Mitigation & recovery

  • Our monitoring detected elevated error rates and latency immediately upon onset.
  • We continued to operate our multi-region system and failover paths, though some dependencies routed through US-EAST-1 were affected until AWS restored normal operation.
  • We informed our users of the partial degradation and provided status updates as recovery progressed.
  • We confirmed return to full operations once AWS declared service normalisation and our metrics returned to expected levels.

Lessons learned & next steps

  1. Enhance visibility into provider-side incident indicators (e.g., upstream region health, dependency mappings) so we can adapt routing more proactively.
  2. Investigate additional fallback mechanisms for critical dependencies that may still route through a primary region (even in multi-region setups).
  3. Refine our customer communications for provider-outage scenarios, including clearer timelines, impact descriptions, and reassurance of our multi-region posture.

Summary

Although Blitzz is architected with multi-region redundancy and video routing across regions, the AWS US-EAST-1 incident on October 20 disrupted services due to a foundational failure in AWS infrastructure. Services have since been fully restored. We are reviewing and enhancing our resilience further to minimise risk from upstream provider outages in the future.

Posted Oct 22, 2025 - 06:39 PDT

Resolved

Starting around 12:11am PST on Monday, October 20, 2025, we experienced a partial degradation of several core services on the Blitzz platform. This was caused by an operational issue with our upstream provider, Amazon Web Services (AWS), specifically impacting their US-EAST-1 region.

The outage affected multiple AWS services and had a downstream impact on our video infrastructure, leading to intermittent connection failures, session drops, and elevated error rates for live video and related features.

While AWS resolved the core issue later that afternoon, some residual effects — such as increased latency and slower load times — persisted for a short period until full recovery was confirmed.

All systems have since returned to normal operation, and we continue to monitor performance across all regions.
Posted Oct 20, 2025 - 00:00 PDT