Video Stream Failover: Complete Guide to Zero-Downtime Streaming
Learn how video stream failover works, why it matters for live broadcasts, and how to implement automatic failover with Vajracast for zero-downtime streaming.
What Is Video Stream Failover?
Video stream failover is the automatic process of switching from a failed or degraded video source to a backup source without interrupting the output stream. When a primary input drops (whether due to encoder failure, network outage, or signal degradation), the failover system detects the problem and routes a backup source to the output in its place.
For viewers, the goal is invisibility. A properly implemented failover switch should be imperceptible: no black frames, no buffering spinner, no interruption. The stream simply continues as though nothing happened.
Failover is not optional for professional broadcasting. Every live production that matters (sports coverage, news broadcasts, corporate events, 24/7 channels) relies on some form of failover protection. The question is not whether you need it, but how to implement it correctly.
Why Failover Matters More Than Ever
The economics of live streaming have changed. A decade ago, a dropped stream was an inconvenience. Today, it is a direct financial loss:
- Advertising revenue evaporates the moment viewers leave a broken stream
- Platform algorithms penalize channels with reliability issues, reducing future discoverability
- Contractual SLAs in enterprise and sports broadcasting carry financial penalties for downtime
- Brand reputation takes a hit that no post-mortem can fully repair
The shift to IP-based transport (away from dedicated SDI circuits) has increased both the opportunity and the risk. IP networks are cheaper and more flexible, but they introduce failure modes that dedicated circuits never had: packet loss, route changes, congestion, and endpoint crashes. Failover is the mechanism that makes IP transport trustworthy enough for mission-critical broadcasting.
Types of Failover: Hot, Warm, and Cold Standby
Not all failover is created equal. The three standard approaches differ in readiness, cost, and switching speed.
Hot Standby
In a hot standby configuration, the backup source is fully active and synchronized with the primary. Both sources are receiving, decoding, and buffering simultaneously. When the primary fails, the switch is instantaneous because the backup is already running.
Characteristics:
- Switching time: sub-50ms (total failover including detection: under 200ms)
- Resource cost: 2x the ingest bandwidth and processing
- Reliability: highest. Backup is proven live before it is needed
- Use case: mission-critical broadcasts where any interruption is unacceptable
Hot standby is what Vajracast implements by default. Every input in a failover chain is actively monitored and pre-buffered, so the switch happens in the time it takes to redirect an internal pointer, not the time it takes to establish a new connection.
Warm Standby
In warm standby, the backup source is connected but not fully active. The connection is established and periodically validated, but the system is not continuously decoding the full stream. On failover, there is a brief initialization period.
Characteristics:
- Switching time: 500ms to 2 seconds
- Resource cost: lower than hot standby (connection overhead only)
- Reliability: good, but there is a visible transition
- Use case: secondary feeds, non-critical streams, cost-sensitive deployments
Cold Standby
Cold standby means the backup source is configured but not connected. On primary failure, the system initiates a new connection from scratch: DNS resolution, TCP/UDP handshake, stream negotiation, and buffering.
Characteristics:
- Switching time: 2 to 10+ seconds
- Resource cost: minimal until failover triggers
- Reliability: lowest. The backup path is untested until it is needed
- Use case: disaster recovery, where some downtime is acceptable
For professional broadcasting, hot standby is the only option that meets audience expectations. Cold standby is better suited for background infrastructure (e.g., failing over a recording server) where a few seconds of gap is tolerable.
How Vajracast Implements Failover
Vajracast was designed with failover as a core architectural component, not an afterthought bolted onto a routing engine. Here is how it works under the hood.
Priority Chains
Every route in Vajracast can have multiple inputs arranged in a priority chain. The input with the highest priority is the preferred source. If it fails, the system automatically switches to the next input in the chain.
Priority 1: SRT Listener (main encoder) ← active
Priority 2: SRT Caller (backup encoder) ← hot standby
Priority 3: RTMP (cloud encoder) ← hot standby
Priority 4: HTTP/TS (slate/fallback) ← hot standby
There is no limit to the number of inputs in a chain. Each input is independently monitored, and the system always selects the highest-priority healthy input.
Health Monitoring
Vajracast continuously evaluates the health of every input using multiple signals:
- Connection state: is the source connected and delivering data?
- Bitrate analysis: is the bitrate within expected range, or has it dropped below a configurable threshold?
- Packet loss rate: for SRT inputs, is loss exceeding the recovery capacity?
- Continuity counters: are MPEG-TS continuity counters incrementing correctly, or are there gaps?
- Timeout detection: has data stopped arriving entirely?
Each health signal has a configurable threshold and hysteresis window. This prevents false failovers caused by momentary network glitches. For example, you might configure: “fail over if packet loss exceeds 15% for more than 300ms continuously.”
Sub-200ms Switching
When a failover condition is detected, the switch happens in three phases:
- Detection (configurable, typically 50-100ms): health metrics cross the threshold for the configured duration
- Decision (under 1ms): the routing engine selects the next healthy input from the priority chain
- Switching (under 1ms): the internal stream pointer redirects to the backup input’s pre-buffered data
Because backup inputs are already ingested, decoded, and buffered in hot standby, the actual switch is a pointer operation. There is no connection negotiation, no buffering delay, no codec initialization. The output continues with data from the backup source on the very next packet.
Total failover time: under 200ms in worst case, typically under 100ms. At 30fps, that is 3-6 frames, imperceptible to viewers.
Failover Selection Strategy
When failover fires, which standby input takes over? In quality mode, Vajracast offers two strategies (configurable per route):
- Best Score (default): switches to the input with the best current quality — the one with the highest composite health score across silence, continuity counter errors, bitrate, and jitter. Use this when your inputs have different qualities and you always want the best one available.
- Round-robin: moves down the list to the next input still receiving signal, in priority order (wraps to the top after reaching the bottom). Skips dead inputs. Use this when inputs are equivalent and you want deterministic ordering.
In simple mode, there is no strategy choice — it is implicit round-robin across the connected inputs. The strategy setting only matters in quality mode.
Optional Auto-Failback
Failover handles the drop. Failback is the return journey: when a higher-priority input recovers, Vajracast can move back up the list to it automatically. This is opt-in per route (checkbox OFF by default in the Failover Settings modal). Without it, after a failover you stay on the backup until an operator manually switches back. With it enabled, the system walks back up toward the highest working input with zero touch.
Failback is not a simple “switch back on first packet” operation. That approach is a trap: a flapping link returns for two seconds, triggers failback, drops again, triggers failover, loops. The result is a stream that looks worse than if you had stayed on the backup. Vajracast’s failback engine is built around that problem.
State-driven evaluation. A 5-second timer ticks continuously on every route with failback enabled. At each tick, it re-reads the state of all inputs. If an input of higher priority than the current active is healthy, the stability counter starts.
Stability window (default 3 seconds, configurable). The candidate must stay healthy across consecutive ticks before the switch fires:
- Tick 1: candidate healthy →
cleanChecks = 1 - Tick 2: candidate still healthy →
cleanChecks = 2 - Tick 3: candidate still healthy →
cleanChecks = 3→ switch - One bad tick at any point →
cleanChecks = 0, full reset
The reset is deliberately strict. A single hiccup wipes all accumulated progress. This is what makes the system immune to flapping.
Priority-aware, always. Unlike failover (which can use round-robin or best-score strategies), failback always moves by priority order — it is “going home to primary,” not “picking the best available.” If you are running on input #3 and #2 recovers, failback switches to #2. If #1 then recovers, another failback cycle switches to #1. The system always climbs toward the highest-priority healthy input.
Cooldown (default 7 seconds, configurable). Immediately after a successful failback, all evaluation pauses. This prevents a race condition where a stability counter on another candidate input could fire a second switch right after the first. The cooldown gives the newly promoted input time to settle.
Health criteria scale with failover mode. In simple mode, “healthy” means connected with bandwidth > 0 — packets are arriving. In quality mode, the candidate must pass every configured threshold (continuity counter errors, jitter, minimum bitrate) throughout the entire stability window. A link that reconnects but at low bitrate does not trigger failback in quality mode. The system prefers to hold the backup rather than switch to a borderline signal.
The settings live in the Failover Settings modal: an Auto-failback checkbox, a Failback Stability field (ms), and a Cooldown field (ms). Routes where failback is active are visually flagged with a ↩ suffix on the failover badge across Table, Card, and Diagram views.
Protocol-Agnostic Failover
One of Vajracast’s architectural advantages is that failover works across input types. The priority chain can mix any combination of supported inputs:
| Priority | Input type | Source | Notes |
|---|---|---|---|
| 1 | SRT (listener) | Main encoder on-site | Lowest latency, AES-256 encrypted |
| 2 | SRT (caller) | Backup encoder on-site | Independent network path |
| 3 | SRT from a cellular unit | Mobile encoder over LTE/5G | The cellular unit uses SRTLA bonding internally to survive individual modem drops; the receiver hands a standard SRT stream to the routing engine |
| 4 | RTMP | Cloud encoder | Legacy compatibility |
| 5 | Bars & Tone | Built-in generator | SMPTE pattern + channel ID overlay, zero external dependency |
This flexibility is essential for real-world deployments where not every source uses the same protocol. A remote contributor might send RTMP because their encoder does not support SRT. A mobile unit might use an SRTLA-bonded link to reach the gateway reliably over cellular. The on-site encoder uses SRT for optimal performance. From the failover engine’s perspective, all of these are just inputs — each is monitored, health-checked, and eligible to take the active slot when higher-priority inputs fail.
SRTLA is not failover
A point worth spelling out, since the two often get conflated: SRTLA is not a failover mechanism. It is a link-aggregation protocol that bonds multiple physical network connections (typically cellular modems) into a single logical SRT stream. If one bonded link drops, the remaining links keep the stream alive — the stream itself never fails over to anything. From the gateway’s point of view, an SRTLA input is one input, one stream, one entry in the priority chain.
Failover operates at a different layer: it switches between entirely independent inputs. SRTLA handles link redundancy within a single source; failover handles source redundancy across sources. They are complementary and orthogonal. You can run failover without SRTLA (two unbonded SRT streams over different ISPs). You can run SRTLA without failover (one bonded cellular stream as your sole input). Or you can combine them: use an SRTLA-bonded cellular input as one slot in a failover chain that also contains a fiber SRT input.
For a deeper comparison of SRT and RTMP and when to use each, see SRT vs RTMP: Which Streaming Protocol Should You Use?.
Real-World Failover Use Cases
Live Sports Broadcasting
Sports broadcasting is the most demanding failover scenario. A dropped feed during a goal, a touchdown, or a race finish is unrecoverable. The moment is gone, and no replay can substitute for the live experience.
Typical configuration:
- Primary: SRT from on-site production truck
- Backup 1: SRT from a second encoder on an independent network path (separate ISP or dedicated circuit)
- Backup 2: Cellular path as a last resort — the mobile encoder uses SRTLA bonding internally to survive individual modem failures, but from the failover chain’s perspective this is just one more SRT input
- Backup 3: Built-in Bars & Tone generator with “Technical difficulties” text overlay
Vajracast’s priority chain handles this natively. The system runs all four inputs in hot standby, monitoring each one continuously. If the primary encoder crashes, the switch to Backup 1 happens in under 100ms. If the entire venue loses wired internet, the cellular path takes over — and because that cellular path is SRTLA-bonded across several modems, a single modem dropping out does not even register at the failover layer. If the entire cellular link fails (all modems dead), only then does failover escalate to the Bars & Tone fallback rather than a broken player.
We have been running 40+ routes in this configuration for live sports production, 24/7. The system has been tested in real conditions, not just lab environments. For a deeper look at failover architectures for sports production, see our live sports broadcasting guide.
24/7 Linear Channels
Channels that broadcast around the clock (news networks, music channels, religious programming) cannot afford any downtime. Unlike event-based production where there is a defined start and end, 24/7 channels must survive every possible failure scenario across weeks and months.
Typical configuration:
- Primary: SRT from the playout server
- Backup 1: SRT from a redundant playout server
- Backup 2: HTTP/TS pull from a pre-programmed playlist server
- Failover is combined with crash recovery. If the Vajracast process itself restarts, it rebuilds all routes automatically in under 5 seconds
The crash recovery feature is especially important here. In a 24/7 environment, the gateway must survive not just input failures but its own restarts (OS updates, process crashes, hardware maintenance). Vajracast’s process adoption system detects running FFmpeg processes after a restart and reconnects to them without interrupting the output streams.
Remote Production (REMI)
Remote production moves the production control room away from the venue. Camera feeds are sent over IP to a central facility where switching, graphics, and distribution happen. This model relies entirely on reliable transport, and failover is the safety net.
Typical configuration:
- Primary: SRT from each camera encoder at the venue
- Backup: an SRTLA-bonded cellular link per camera, feeding a secondary SRT input in the same priority chain (the bonding handles single-modem failures inside the link; failover handles the case where the cellular link is completely down)
- Return feed: SRT back to the venue for IFB (interruptible foldback) and confidence monitoring
In REMI workflows, every camera is an independent failover chain. Vajracast handles this by creating separate routes for each camera, each with its own priority chain and health monitoring. For real-world REMI deployment strategies including Starlink connectivity, see our remote production with SRT guide. The diagram view in the UI makes it straightforward to visualize and manage dozens of routes simultaneously.
Monitoring and Alerting for Failover Events
Failover that you cannot observe is failover you cannot trust. Effective monitoring has three layers:
Real-Time Dashboard
Vajracast’s web interface shows the status of every input in every route:
- Green: healthy, active
- Yellow: connected but degraded (high loss, low bitrate)
- Red: disconnected or failed
- Active indicator showing which input in the priority chain is currently feeding the output
The diagram view provides a visual map of all routes, with real-time status overlays on every connection.
Prometheus Metrics
Vajracast exposes 50+ metrics via a /metrics endpoint compatible with Prometheus. Failover-related metrics include:
vajracast_input_status{route="sports_main", input="primary"} 1
vajracast_input_status{route="sports_main", input="backup1"} 1
vajracast_failover_events_total{route="sports_main"} 3
vajracast_failover_last_timestamp{route="sports_main"} 1707523200
vajracast_input_bitrate_bps{route="sports_main", input="primary"} 8500000
vajracast_input_packet_loss{route="sports_main", input="primary"} 0.002
These metrics can be graphed in Grafana (pre-built dashboards are included) and used to trigger alerts via Alertmanager. For example: “Alert if any route has executed more than 2 failover events in the past hour.”
Event Logging and Webhooks
Every failover event is logged with:
- Timestamp
- Route name
- Source input (which failed)
- Target input (which took over)
- Reason (timeout, packet loss threshold, bitrate drop, manual switch)
- Duration on backup before recovery
This log is invaluable for post-event analysis. If failover triggered during a broadcast, you can trace exactly what happened, when, and why.
Best Practices for Configuring Failover
1. Use Independent Network Paths
If your primary and backup inputs share the same network switch, ISP, or cable run, a single network failure takes out both. True redundancy requires independent paths:
- Different ISPs for primary and backup
- Different physical network interfaces
- Different cable runs (separate conduit)
- For cellular backup, different carriers
2. Test Your Failover Regularly
A failover system that has never been tested is not a failover system. It is a hope. Schedule regular failover drills:
- Pull the primary encoder’s network cable during a test stream
- Kill the encoder process and measure switch time
- Inject packet loss using network simulation tools (
tc netemon Linux) to test threshold detection - Verify that auto-recovery works when the primary comes back
Test under load. Failover behavior can differ when the system is handling 50 routes versus 2.
3. Tune Your Thresholds
Default thresholds are a starting point. Tune them based on your specific environment:
- Timeout too aggressive (e.g., 50ms): causes false failovers on momentary network jitter
- Timeout too conservative (e.g., 5 seconds): viewers see 5 seconds of broken video before the switch
- Recommended starting point: 200-500ms timeout, 10% packet loss threshold, 50% bitrate floor
Monitor your failover event log. If you see frequent failovers followed by immediate recovery, your thresholds are too aggressive.
4. Always Have a Bars & Tone Fallback in Last Position
The last input in your priority chain should be something that cannot fail. Vajracast ships a built-in Bars & Tone generator for exactly this purpose: a virtual input that produces a real MPEG-TS stream locally without any external source, network, or encoder. Since it runs on the server itself, it is always available. There is nothing to disconnect.
The generator is not a static image. It is a professional-grade test pattern signal:
- Video patterns: SMPTE bars (75% or 100% HD), PAL 100% bars, or FFmpeg
testsrc2with moving elements - Six presets covering 1080p25, 1080i50, 576i50 (8-channel audio), 720p50 with clock, 1080p25 HEVC, and 540p25 for low-bitrate scenarios
- Text overlay: burned-in company name, channel identifier, or custom message with configurable font, size, color, and position
- Clock overlay: server-time burn-in for lip-sync debugging and live-proof timestamping
- Frame-identifiable animation: square pulse or staircase pulse makes the pattern unambiguous frame by frame
- Logo overlay: PNG logo positioned in a corner
- Audio tone: 1 kHz, 400 Hz, 440 Hz, or silence; 2, 4, or 8 channels; configurable level from 0 to -20 dBFS; optional sweep mode that cycles the tone channel by channel
Beyond emergency fallback, the same generator handles several workflows that keep routes warm and reduce test friction:
- Downstream validation: spawn a Bars & Tone input before live production starts. Your HLS viewers, SRT callers, and multiviewers immediately receive a known-good signal. If the viewer sees nothing, the problem is downstream, not upstream
- Warm slot holding: while the OB van is preparing, the route stays live on bars. Remote decoders stay connected, no SRT disconnect to deal with. When the real signal arrives, a manual failover (Set Active) or a configured priority chain takes over cleanly
- Audio channel verification: sweep mode runs the tone across every channel so your downstream operator can confirm 5.1 or 7.1 cabling is correct
- Lip-sync measurement: the animated pattern combined with the tone lets you measure downstream audio/video offset visually
- Prospect demos: show a complete Vajracast workflow from a conference room with zero physical equipment
Configure it as priority N+1 in your failover chain. If every live input drops, viewers see an intentional test pattern with your channel name rather than a frozen frame or a broken player.
5. Monitor Your Backup Sources
A backup source that is offline when you need it is worthless. Hot standby monitoring is not just about readiness. It is about continuously validating that the backup is healthy. Vajracast monitors all inputs in a priority chain equally, whether they are active or on standby. If your backup goes down, you know immediately, not when the primary fails and the backup fails to take over.
6. Plan for Gateway-Level Redundancy
Failover protects against input failure. But what about gateway failure? For the highest reliability, run two Vajracast instances:
- Primary gateway handles all production routes
- Secondary gateway mirrors the configuration and can take over via DNS failover or load balancer health checks
- Both instances can use the same Docker/Kubernetes deployment infrastructure
How Vajracast Compares to Other Failover Solutions
| Feature | Vajracast | Hardware Switcher | Cloud Failover (AWS) | Manual Switching |
|---|---|---|---|---|
| Switching speed | <200ms | <50ms (frame-accurate) | 2-10s | 5-30s (human reaction) |
| Protocol support | SRT, RTMP, RTSP, HLS, SRTLA, UDP, HTTP | SDI/HDMI only | RTMP, HLS | Any |
| Inputs per chain | Unlimited | 2-4 (hardware dependent) | Varies | N/A |
| Monitoring | Built-in + Prometheus | Typically minimal | CloudWatch | None |
| Cost | Software license | $5,000-$50,000+ | Per-minute compute | Labor cost |
| Remote management | Full web UI + REST API | Limited or none | AWS Console/API | Physical presence |
| Scalability | 50+ routes per instance | 1 route per device | Elastic but expensive | Not scalable |
Hardware switchers excel at frame-accurate switching for SDI workflows but cannot handle IP-based multi-protocol environments. Cloud solutions introduce latency and per-minute costs that add up fast. Manual switching is inherently unreliable because it depends on a human being awake, alert, and fast.
Vajracast occupies the middle ground: software-defined, IP-native, multi-protocol, and automated, at a fraction of the cost of hardware or cloud alternatives.
Putting It All Together
For a real-world reference of a redundant Vajracast deployment with multi-input failover across two ingests and four restream regions, see the example deployment — annotated diagrams with hover details on every node.
A complete failover setup in Vajracast follows this structure:
- Define your route: one output destination (e.g., SRT push to CDN)
- Add primary input: your main encoder, highest priority
- Add backup inputs: in priority order, each on an independent path
- Add a static fallback: lowest priority, guaranteed availability
- Configure health thresholds: timeout, packet loss, bitrate floor
- Set recovery behavior: auto-recover with hold-off timer, or manual
- Connect monitoring: Prometheus scraping, Grafana dashboards, alerting
- Test everything: simulate failures before going live
With this configuration, your stream is protected against encoder failure, network outage, protocol issues, and even complete venue connectivity loss. The system handles it all automatically, silently, and reliably.
For a step-by-step setup guide, see SRT Streaming Setup: From Zero to Production. For the broader architecture of stream routing and distribution, continue to Live Stream Routing: The Complete Guide.
Next Steps
- Broadcast Hub: the central routing platform that manages failover across regions
- SRT Streaming Gateway: the complete guide to SRT-based video infrastructure
- Video Failover Best Practices: shorter, tactical guide to failover configuration
- SRT vs RTMP: understand the protocol trade-offs that affect failover performance
- Live Stream Routing: how to route, split, and manage video signals across your infrastructure
Managed cloud platform with dedicated servers, N+1 failover, hardware transcoding, and global delivery. Free for 30 days.
30 days free · No credit card · Direct access to the dev team
Frequently Asked Questions
What is video stream failover?
Video stream failover is an automatic mechanism that switches to a backup video source when the primary source fails, ensuring continuous streaming without interruption.
How fast should failover switching be?
Professional broadcast failover should switch in under 500ms. Vajracast achieves sub-50ms switchover by pre-buffering backup sources in hot standby, with total end-to-end failover (including detection) under 200ms.
Can I have multiple backup sources?
Yes. Vajracast supports N+1 redundancy with unlimited backup sources in a priority chain. Each source is independently monitored with configurable health thresholds.
Does failover work with different protocols?
Yes. A priority chain can mix SRT, RTMP, RTSP, HLS, UDP, and HTTP inputs. SRTLA-bonded inputs are also supported — the receiver deaggregates them into standard SRT before the routing engine sees them, so they behave like any other SRT input in the chain. The failover mechanism is the same regardless of input type.