Video Stream Failover: Best Practices for Zero-Downtime Broadcasting
Why Failover Matters
In live broadcasting, a dropped stream isn’t just a technical issue. It’s lost audience, lost revenue, and damaged reputation. From a sports event with 50,000 viewers to a corporate town hall with 500 employees, the expectation is the same: it must not go down.
Video stream failover is the safety net that catches your broadcast when the primary feed fails.
What is Video Failover?
Failover is the automatic switching from a primary video input to a backup when the system detects a failure. A good failover system:
- Detects failure fast: milliseconds, not seconds
- Switches cleanly: minimal visual disruption for viewers
- Picks the right standby: either by best current quality (Best Score) or by moving down the list to the next input still receiving signal (Round-robin) — your call per route
- Offers optional auto-failback: opt-in per route. When a higher-priority input recovers and holds stable, the system moves back up the list to it. Off by default so you stay put unless you’ve explicitly asked to go back
- Requires no manual intervention when live: automation is the whole point
Viewer side: what they see (and don’t see)
Failover and failback are two distinct operations:
- Failover = switching from the primary input to a backup when the primary drops (encoder crashed, fibre cut, Internet link dropped).
- Failback = the reverse switch: automatic return to the primary input once it’s healthy and stable again.
Vajracast can do both automatically. Failover is always active on multi-input routes. Failback is opt-in per route (off by default), for operators who’d rather stay on the backup and decide for themselves when to return.
For HLS viewers, the transition is imperceptible: no black screen, no error message — the stream continues with the new source. A 1–3 second micro-discontinuity may occur while a new HLS segment is published, but it’s usually invisible (the player waits for the next segment exactly as it does during normal playback).
For direct SRT viewers, the switch is faster: the selector redirects the stream as soon as the stability window (3 s by default) is cleared on the new source, and the SRT viewer resumes playback without significant buffering.
Built-in anti-flap: the stability_window=3s and cooldown=7s parameters prevent the system from oscillating on a link that oscillates itself. If your primary keeps recovering and dropping in a loop, the selector stays on the backup until real stability returns.
Architecture: Redundant Inputs
The foundation of any failover setup is redundant inputs. You need at least two independent paths:
Active/Standby
The simplest model. One input is active, the other is hot standby:
Primary SRT → [Gateway] → Output
Backup RTMP → [Gateway] ↗ (on failure)
- Primary carries the stream
- Backup is connected and ready but not used
- On primary failure, gateway switches to backup
Active/Active
Both inputs carry the stream simultaneously. The gateway selects the best one:
Input A (SRT) → [Gateway: compare] → Best signal → Output
Input B (SRT) → [Gateway: compare] ↗
- Both paths are monitored in real-time
- Gateway can switch based on quality, not just connectivity
- More bandwidth cost, but higher reliability
Detection: How Fast Can You React?
The speed of failover depends on how quickly you detect the problem. Common detection methods:
Stream Health Monitoring
Monitor the incoming stream for:
- Packet loss: SRT reports this in real-time
- Bitrate drops: sudden bitrate decrease often precedes a full failure
- Black/frozen frames: content-aware detection (advanced)
- Audio silence: loss of audio signal
Timeouts
Set aggressive but realistic timeouts:
| Detection Method | Typical Timeout | Notes |
|---|---|---|
| SRT packet loss | <50ms | SRT statistics report instantly |
| TCP disconnect | 1-5s | TCP timeout dependent |
| Bitrate threshold | 200-500ms | Configurable window |
| Content analysis | 500ms-2s | Compute intensive |
The 50ms Target
Professional broadcast equipment targets sub-50ms failover. This means:
- Failure detected within 20ms
- Switch command issued within 10ms
- Output buffer absorbs the transition within 20ms
At 50ms, the switch is invisible to viewers, happening within 1-2 video frames.
Implementation Patterns
Pattern 1: Gateway-Level Failover
The gateway itself handles failover logic. This is the simplest and most reliable approach.
Vajracast implements this natively:
- Configure primary and backup inputs in a priority chain (up to 8 per route)
- Set detection thresholds (packet loss %, bitrate floor, timeout)
- Pick a selection strategy for when failover fires: Best Score (default, switches to the input with the best current quality) or Round-robin (moves down the list to the next input still receiving signal). Simple mode skips the choice and goes round-robin on connected inputs
- Opt into auto-failback per route (off by default). When enabled, a higher-priority input that recovers and holds stable across the stability window (default 3s) gets promoted back — the system moves back up the list to it. A cooldown (default 7s) prevents ping-pong between candidates
Pattern 2: Encoder-Level Redundancy
Run two encoders independently, each sending to the gateway:
Camera → Encoder A → SRT → Gateway
Camera → Encoder B → SRT → Gateway (backup)
This protects against encoder failure, not just network failure.
Pattern 3: Geographic Redundancy
For mission-critical broadcasts, distribute across locations:
Venue Encoder → SRT → Gateway (Region A)
Venue Encoder → SRT → Gateway (Region B) [failover]
Both gateways output to CDN. The CDN-level origin failover provides the final layer of protection.
Monitoring and Alerts
Failover without monitoring is flying blind. Set up:
- Real-time dashboards: visualize all input health metrics simultaneously
- Automated alerts: get notified when failover activates (Slack, email, webhook)
- Event logging: timestamp every switch event for post-mortem analysis
- Recovery notifications: know when the primary is back and stable
Testing Your Failover
Never trust a failover system you haven’t tested. Test regularly:
- Scheduled drills: pull the primary cable during a test stream
- Network simulation: inject packet loss with tools like
tcto test SRT recovery vs. failover threshold - Encoder failure: kill the encoder process and measure switch time
- Recovery testing: verify the system returns to primary after a failure
- Load testing: confirm failover works under peak output conditions
Common Mistakes
- Single point of failure in the switch itself: if your failover device fails, everything fails. Use a proven, hardened gateway.
- Backup feed not monitored: your backup might be dead when you need it. Monitor both inputs at all times.
- Too-aggressive timeouts: switching on momentary packet loss creates unnecessary disruption. Tune your thresholds.
- No automatic failback: manual “switch back to primary” means someone has to be awake at 3 AM. Worse, without anti-flap protection, a naive auto-switchback loops endlessly on a flapping link. Use a stability window (3-5s) and a cooldown (5-10s) to absorb recovery jitter.
- Not testing: the first time your failover fires shouldn’t be during a live event.
The Vajracast Advantage
Vajracast was designed with failover as a core feature, not an afterthought:
- Multi-input failover with configurable priority chains (up to 8 inputs per route)
- Selection strategies in quality mode: Best Score (default) or Round-robin
- Sub-50ms switching on SRT inputs
- Real-time health monitoring with per-input metrics
- Optional auto-failback (opt-in per route) with anti-flap stability window and post-switch cooldown
- Full event logging for every failover and failback event
- Protocol-agnostic: works across SRT, RTMP, RTSP, UDP, and HLS inputs. SRTLA-bonded inputs are supported too — they are deaggregated into standard SRT before reaching the routing engine, so they slot into a failover chain like any other SRT input. SRTLA itself is not a failover mechanism; it is link aggregation within a single stream
- Built-in Bars & Tone generator as a guaranteed last-position fallback (SMPTE bars, configurable audio tone, clock overlay, logo) — no external source required
Managed cloud platform with dedicated servers, N+1 failover, hardware transcoding, and global delivery. Free for 30 days.
30 days free · No credit card · Direct access to the dev team