Video Stream Failover: Best Practices for Zero-Downtime Broadcasting

February 25, 2026 · Vajra Engineering · tutorials

Why Failover Matters

In live broadcasting, a dropped stream isn’t just a technical issue. It’s lost audience, lost revenue, and damaged reputation. From a sports event with 50,000 viewers to a corporate town hall with 500 employees, the expectation is the same: it must not go down.

Video stream failover is the safety net that catches your broadcast when the primary feed fails.

What is Video Failover?

Failover is the automatic switching from a primary video input to a backup when the system detects a failure. A good failover system:

Detects failure fast: milliseconds, not seconds
Switches cleanly: minimal visual disruption for viewers
Picks the right standby: either by best current quality (Best Score) or by moving down the list to the next input still receiving signal (Round-robin) — your call per route
Offers optional auto-failback: opt-in per route. When a higher-priority input recovers and holds stable, the system moves back up the list to it. Off by default so you stay put unless you’ve explicitly asked to go back
Requires no manual intervention when live: automation is the whole point

Viewer side: what they see (and don’t see)

Failover and failback are two distinct operations:

Failover = switching from the primary input to a backup when the primary drops (encoder crashed, fibre cut, Internet link dropped).
Failback = the reverse switch: automatic return to the primary input once it’s healthy and stable again.

Vajracast can do both automatically. Failover is always active on multi-input routes. Failback is opt-in per route (off by default), for operators who’d rather stay on the backup and decide for themselves when to return.

For HLS viewers, the transition is imperceptible: no black screen, no error message — the stream continues with the new source. A 1–3 second micro-discontinuity may occur while a new HLS segment is published, but it’s usually invisible (the player waits for the next segment exactly as it does during normal playback).

For direct SRT viewers, the switch is faster: the selector redirects the stream as soon as the stability window (3 s by default) is cleared on the new source, and the SRT viewer resumes playback without significant buffering.

Built-in anti-flap: the stability_window=3s and cooldown=7s parameters prevent the system from oscillating on a link that oscillates itself. If your primary keeps recovering and dropping in a loop, the selector stays on the backup until real stability returns.

Architecture: Redundant Inputs

The foundation of any failover setup is redundant inputs. You need at least two independent paths:

Active/Standby

The simplest model. One input is active, the other is hot standby:

Primary SRT → [Gateway] → Output
Backup RTMP → [Gateway] ↗ (on failure)

Primary carries the stream
Backup is connected and ready but not used
On primary failure, gateway switches to backup

Active/Active

Both inputs carry the stream simultaneously. The gateway selects the best one:

Input A (SRT) → [Gateway: compare] → Best signal → Output
Input B (SRT) → [Gateway: compare] ↗

Both paths are monitored in real-time
Gateway can switch based on quality, not just connectivity
More bandwidth cost, but higher reliability

Detection: How Fast Can You React?

The speed of failover depends on how quickly you detect the problem. Common detection methods:

Stream Health Monitoring

Monitor the incoming stream for:

Packet loss: SRT reports this in real-time
Bitrate drops: sudden bitrate decrease often precedes a full failure
Black/frozen frames: content-aware detection (advanced)
Audio silence: loss of audio signal

Timeouts

Set aggressive but realistic timeouts:

Detection Method	Typical Timeout	Notes
SRT packet loss	<50ms	SRT statistics report instantly
TCP disconnect	1-5s	TCP timeout dependent
Bitrate threshold	200-500ms	Configurable window
Content analysis	500ms-2s	Compute intensive

The 50ms Target

Professional broadcast equipment targets sub-50ms failover. This means:

Failure detected within 20ms
Switch command issued within 10ms
Output buffer absorbs the transition within 20ms

At 50ms, the switch is invisible to viewers, happening within 1-2 video frames.

Implementation Patterns

Pattern 1: Gateway-Level Failover

The gateway itself handles failover logic. This is the simplest and most reliable approach.

Vajracast implements this natively:

Configure primary and backup inputs in a priority chain (up to 8 per route)
Set detection thresholds (packet loss %, bitrate floor, timeout)
Pick a selection strategy for when failover fires: Best Score (default, switches to the input with the best current quality) or Round-robin (moves down the list to the next input still receiving signal). Simple mode skips the choice and goes round-robin on connected inputs
Opt into auto-failback per route (off by default). When enabled, a higher-priority input that recovers and holds stable across the stability window (default 3s) gets promoted back — the system moves back up the list to it. A cooldown (default 7s) prevents ping-pong between candidates

Pattern 2: Encoder-Level Redundancy

Run two encoders independently, each sending to the gateway:

Camera → Encoder A → SRT → Gateway
Camera → Encoder B → SRT → Gateway (backup)

This protects against encoder failure, not just network failure.

Pattern 3: Geographic Redundancy

For mission-critical broadcasts, distribute across locations:

Venue Encoder → SRT → Gateway (Region A)
Venue Encoder → SRT → Gateway (Region B) [failover]

Both gateways output to CDN. The CDN-level origin failover provides the final layer of protection.

Monitoring and Alerts

Failover without monitoring is flying blind. Set up:

Real-time dashboards: visualize all input health metrics simultaneously
Automated alerts: get notified when failover activates (Slack, email, webhook)
Event logging: timestamp every switch event for post-mortem analysis
Recovery notifications: know when the primary is back and stable

Testing Your Failover

Never trust a failover system you haven’t tested. Test regularly:

Scheduled drills: pull the primary cable during a test stream
Network simulation: inject packet loss with tools like tc to test SRT recovery vs. failover threshold
Encoder failure: kill the encoder process and measure switch time
Recovery testing: verify the system returns to primary after a failure
Load testing: confirm failover works under peak output conditions

Common Mistakes

Single point of failure in the switch itself: if your failover device fails, everything fails. Use a proven, hardened gateway.
Backup feed not monitored: your backup might be dead when you need it. Monitor both inputs at all times.
Too-aggressive timeouts: switching on momentary packet loss creates unnecessary disruption. Tune your thresholds.
No automatic failback: manual “switch back to primary” means someone has to be awake at 3 AM. Worse, without anti-flap protection, a naive auto-switchback loops endlessly on a flapping link. Use a stability window (3-5s) and a cooldown (5-10s) to absorb recovery jitter.
Not testing: the first time your failover fires shouldn’t be during a live event.

The Vajracast Advantage

Vajracast was designed with failover as a core feature, not an afterthought:

Multi-input failover with configurable priority chains (up to 8 inputs per route)
Selection strategies in quality mode: Best Score (default) or Round-robin
Sub-50ms switching on SRT inputs
Real-time health monitoring with per-input metrics
Optional auto-failback (opt-in per route) with anti-flap stability window and post-switch cooldown
Full event logging for every failover and failback event
Protocol-agnostic: works across SRT, RTMP, RTSP, UDP, and HLS inputs. SRTLA-bonded inputs are supported too — they are deaggregated into standard SRT before reaching the routing engine, so they slot into a failover chain like any other SRT input. SRTLA itself is not a failover mechanism; it is link aggregation within a single stream
Built-in Bars & Tone generator as a guaranteed last-position fallback (SMPTE bars, configurable audio tone, clock overlay, logo) — no external source required

Distribute live broadcast from the cloud

Managed cloud platform with dedicated servers, N+1 failover, hardware transcoding, and global delivery. Free for 30 days.

Start free trial See pricing

30 days free · No credit card · Direct access to the dev team

← Back to Guides