Video Stream Failover: Complete Guide to Zero-Downtime Streaming

Q: What is video stream failover?

Video stream failover is an automatic mechanism that switches to a backup video source when the primary source fails, ensuring continuous streaming without interruption.

Q: How fast should failover switching be?

Professional broadcast failover should switch in under 500ms. Vajracast achieves sub-50ms switchover by pre-buffering backup sources in hot standby, with total end-to-end failover (including detection) under 200ms.

Q: Can I have multiple backup sources?

Yes. Vajracast supports N+1 redundancy with unlimited backup sources in a priority chain. Each source is independently monitored with configurable health thresholds.

Q: Does failover work with different protocols?

Yes. A priority chain can mix SRT, RTMP, RTSP, HLS, UDP, and HTTP inputs. SRTLA-bonded inputs are also supported — the receiver deaggregates them into standard SRT before the routing engine sees them, so they behave like any other SRT input in the chain. The failover mechanism is the same regardless of input type.

What Is Video Stream Failover?

Video stream failover is the automatic process of switching from a failed or degraded video source to a backup source without interrupting the output stream. When a primary input drops (whether due to encoder failure, network outage, or signal degradation), the failover system detects the problem and routes a backup source to the output in its place.

For viewers, the goal is invisibility. A properly implemented failover switch should be imperceptible: no black frames, no buffering spinner, no interruption. The stream simply continues as though nothing happened.

Failover is not optional for professional broadcasting. Every live production that matters (sports coverage, news broadcasts, corporate events, 24/7 channels) relies on some form of failover protection. The question is not whether you need it, but how to implement it correctly.

Why Failover Matters More Than Ever

The economics of live streaming have changed. A decade ago, a dropped stream was an inconvenience. Today, it is a direct financial loss:

Advertising revenue evaporates the moment viewers leave a broken stream
Platform algorithms penalize channels with reliability issues, reducing future discoverability
Contractual SLAs in enterprise and sports broadcasting carry financial penalties for downtime
Brand reputation takes a hit that no post-mortem can fully repair

The shift to IP-based transport (away from dedicated SDI circuits) has increased both the opportunity and the risk. IP networks are cheaper and more flexible, but they introduce failure modes that dedicated circuits never had: packet loss, route changes, congestion, and endpoint crashes. Failover is the mechanism that makes IP transport trustworthy enough for mission-critical broadcasting.

Types of Failover: Hot, Warm, and Cold Standby

Not all failover is created equal. The three standard approaches differ in readiness, cost, and switching speed.

Hot Standby

In a hot standby configuration, the backup source is fully active and synchronized with the primary. Both sources are receiving, decoding, and buffering simultaneously. When the primary fails, the switch is instantaneous because the backup is already running.

Characteristics:

Switching time: sub-50ms (total failover including detection: under 200ms)
Resource cost: 2x the ingest bandwidth and processing
Reliability: highest. Backup is proven live before it is needed
Use case: mission-critical broadcasts where any interruption is unacceptable

Hot standby is what Vajracast implements by default. Every input in a failover chain is actively monitored and pre-buffered, so the switch happens in the time it takes to redirect an internal pointer, not the time it takes to establish a new connection.

Warm Standby

In warm standby, the backup source is connected but not fully active. The connection is established and periodically validated, but the system is not continuously decoding the full stream. On failover, there is a brief initialization period.

Characteristics:

Switching time: 500ms to 2 seconds
Resource cost: lower than hot standby (connection overhead only)
Reliability: good, but there is a visible transition
Use case: secondary feeds, non-critical streams, cost-sensitive deployments

Cold Standby

Cold standby means the backup source is configured but not connected. On primary failure, the system initiates a new connection from scratch: DNS resolution, TCP/UDP handshake, stream negotiation, and buffering.

Characteristics:

Switching time: 2 to 10+ seconds
Resource cost: minimal until failover triggers
Reliability: lowest. The backup path is untested until it is needed
Use case: disaster recovery, where some downtime is acceptable

For professional broadcasting, hot standby is the only option that meets audience expectations. Cold standby is better suited for background infrastructure (e.g., failing over a recording server) where a few seconds of gap is tolerable.

How Vajracast Implements Failover

Vajracast was designed with failover as a core architectural component, not an afterthought bolted onto a routing engine. Here is how it works under the hood.

Priority Chains

Every route in Vajracast can have multiple inputs arranged in a priority chain. The input with the highest priority is the preferred source. If it fails, the system automatically switches to the next input in the chain.

Priority 1: SRT Listener (main encoder) ← active
Priority 2: SRT Caller (backup encoder) ← hot standby
Priority 3: RTMP (cloud encoder)         ← hot standby
Priority 4: HTTP/TS (slate/fallback)     ← hot standby

There is no limit to the number of inputs in a chain. Each input is independently monitored, and the system always selects the highest-priority healthy input.

Health Monitoring

Vajracast continuously evaluates the health of every input using multiple signals:

Connection state: is the source connected and delivering data?
Bitrate analysis: is the bitrate within expected range, or has it dropped below a configurable threshold?
Packet loss rate: for SRT inputs, is loss exceeding the recovery capacity?
Continuity counters: are MPEG-TS continuity counters incrementing correctly, or are there gaps?
Timeout detection: has data stopped arriving entirely?

Each health signal has a configurable threshold and hysteresis window. This prevents false failovers caused by momentary network glitches. For example, you might configure: “fail over if packet loss exceeds 15% for more than 300ms continuously.”

Sub-200ms Switching

When a failover condition is detected, the switch happens in three phases:

Detection (configurable, typically 50-100ms): health metrics cross the threshold for the configured duration
Decision (under 1ms): the routing engine selects the next healthy input from the priority chain
Switching (under 1ms): the internal stream pointer redirects to the backup input’s pre-buffered data

Because backup inputs are already ingested, decoded, and buffered in hot standby, the actual switch is a pointer operation. There is no connection negotiation, no buffering delay, no codec initialization. The output continues with data from the backup source on the very next packet.

Total failover time: under 200ms in worst case, typically under 100ms. At 30fps, that is 3-6 frames, imperceptible to viewers.

Failover Selection Strategy

When failover fires, which standby input takes over? In quality mode, Vajracast offers two strategies (configurable per route):

Best Score (default): switches to the input with the best current quality — the one with the highest composite health score across silence, continuity counter errors, bitrate, and jitter. Use this when your inputs have different qualities and you always want the best one available.
Round-robin: moves down the list to the next input still receiving signal, in priority order (wraps to the top after reaching the bottom). Skips dead inputs. Use this when inputs are equivalent and you want deterministic ordering.

In simple mode, there is no strategy choice — it is implicit round-robin across the connected inputs. The strategy setting only matters in quality mode.

Optional Auto-Failback

Failover handles the drop. Failback is the return journey: when a higher-priority input recovers, Vajracast can move back up the list to it automatically. This is opt-in per route (checkbox OFF by default in the Failover Settings modal). Without it, after a failover you stay on the backup until an operator manually switches back. With it enabled, the system walks back up toward the highest working input with zero touch.

Failback is not a simple “switch back on first packet” operation. That approach is a trap: a flapping link returns for two seconds, triggers failback, drops again, triggers failover, loops. The result is a stream that looks worse than if you had stayed on the backup. Vajracast’s failback engine is built around that problem.

State-driven evaluation. A 5-second timer ticks continuously on every route with failback enabled. At each tick, it re-reads the state of all inputs. If an input of higher priority than the current active is healthy, the stability counter starts.

Stability window (default 3 seconds, configurable). The candidate must stay healthy across consecutive ticks before the switch fires:

Tick 1: candidate healthy → cleanChecks = 1
Tick 2: candidate still healthy → cleanChecks = 2
Tick 3: candidate still healthy → cleanChecks = 3 → switch
One bad tick at any point → cleanChecks = 0, full reset

The reset is deliberately strict. A single hiccup wipes all accumulated progress. This is what makes the system immune to flapping.

Priority-aware, always. Unlike failover (which can use round-robin or best-score strategies), failback always moves by priority order — it is “going home to primary,” not “picking the best available.” If you are running on input #3 and #2 recovers, failback switches to #2. If #1 then recovers, another failback cycle switches to #1. The system always climbs toward the highest-priority healthy input.

Cooldown (default 7 seconds, configurable). Immediately after a successful failback, all evaluation pauses. This prevents a race condition where a stability counter on another candidate input could fire a second switch right after the first. The cooldown gives the newly promoted input time to settle.

Health criteria scale with failover mode. In simple mode, “healthy” means connected with bandwidth > 0 — packets are arriving. In quality mode, the candidate must pass every configured threshold (continuity counter errors, jitter, minimum bitrate) throughout the entire stability window. A link that reconnects but at low bitrate does not trigger failback in quality mode. The system prefers to hold the backup rather than switch to a borderline signal.

The settings live in the Failover Settings modal: an Auto-failback checkbox, a Failback Stability field (ms), and a Cooldown field (ms). Routes where failback is active are visually flagged with a ↩ suffix on the failover badge across Table, Card, and Diagram views.

Protocol-Agnostic Failover

One of Vajracast’s architectural advantages is that failover works across input types. The priority chain can mix any combination of supported inputs:

Priority	Input type	Source	Notes
1	SRT (listener)	Main encoder on-site	Lowest latency, AES-256 encrypted
2	SRT (caller)	Backup encoder on-site	Independent network path
3	SRT from a cellular unit	Mobile encoder over LTE/5G	The cellular unit uses SRTLA bonding internally to survive individual modem drops; the receiver hands a standard SRT stream to the routing engine
4	RTMP	Cloud encoder	Legacy compatibility
5	Bars & Tone	Built-in generator	SMPTE pattern + channel ID overlay, zero external dependency

This flexibility is essential for real-world deployments where not every source uses the same protocol. A remote contributor might send RTMP because their encoder does not support SRT. A mobile unit might use an SRTLA-bonded link to reach the gateway reliably over cellular. The on-site encoder uses SRT for optimal performance. From the failover engine’s perspective, all of these are just inputs — each is monitored, health-checked, and eligible to take the active slot when higher-priority inputs fail.

SRTLA is not failover

A point worth spelling out, since the two often get conflated: SRTLA is not a failover mechanism. It is a link-aggregation protocol that bonds multiple physical network connections (typically cellular modems) into a single logical SRT stream. If one bonded link drops, the remaining links keep the stream alive — the stream itself never fails over to anything. From the gateway’s point of view, an SRTLA input is one input, one stream, one entry in the priority chain.

Failover operates at a different layer: it switches between entirely independent inputs. SRTLA handles link redundancy within a single source; failover handles source redundancy across sources. They are complementary and orthogonal. You can run failover without SRTLA (two unbonded SRT streams over different ISPs). You can run SRTLA without failover (one bonded cellular stream as your sole input). Or you can combine them: use an SRTLA-bonded cellular input as one slot in a failover chain that also contains a fiber SRT input.

For a deeper comparison of SRT and RTMP and when to use each, see SRT vs RTMP: Which Streaming Protocol Should You Use?.

Real-World Failover Use Cases

Live Sports Broadcasting

Sports broadcasting is the most demanding failover scenario. A dropped feed during a goal, a touchdown, or a race finish is unrecoverable. The moment is gone, and no replay can substitute for the live experience.

Typical configuration:

Primary: SRT from on-site production truck
Backup 1: SRT from a second encoder on an independent network path (separate ISP or dedicated circuit)
Backup 2: Cellular path as a last resort — the mobile encoder uses SRTLA bonding internally to survive individual modem failures, but from the failover chain’s perspective this is just one more SRT input
Backup 3: Built-in Bars & Tone generator with “Technical difficulties” text overlay

Vajracast’s priority chain handles this natively. The system runs all four inputs in hot standby, monitoring each one continuously. If the primary encoder crashes, the switch to Backup 1 happens in under 100ms. If the entire venue loses wired internet, the cellular path takes over — and because that cellular path is SRTLA-bonded across several modems, a single modem dropping out does not even register at the failover layer. If the entire cellular link fails (all modems dead), only then does failover escalate to the Bars & Tone fallback rather than a broken player.

We have been running 40+ routes in this configuration for live sports production, 24/7. The system has been tested in real conditions, not just lab environments. For a deeper look at failover architectures for sports production, see our live sports broadcasting guide.

24/7 Linear Channels

Channels that broadcast around the clock (news networks, music channels, religious programming) cannot afford any downtime. Unlike event-based production where there is a defined start and end, 24/7 channels must survive every possible failure scenario across weeks and months.

Typical configuration:

Primary: SRT from the playout server
Backup 1: SRT from a redundant playout server
Backup 2: HTTP/TS pull from a pre-programmed playlist server
Failover is combined with crash recovery. If the Vajracast process itself restarts, it rebuilds all routes automatically in under 5 seconds

The crash recovery feature is especially important here. In a 24/7 environment, the gateway must survive not just input failures but its own restarts (OS updates, process crashes, hardware maintenance). Vajracast’s process adoption system detects running FFmpeg processes after a restart and reconnects to them without interrupting the output streams.

Remote Production (REMI)

Remote production moves the production control room away from the venue. Camera feeds are sent over IP to a central facility where switching, graphics, and distribution happen. This model relies entirely on reliable transport, and failover is the safety net.

Typical configuration:

Primary: SRT from each camera encoder at the venue
Backup: an SRTLA-bonded cellular link per camera, feeding a secondary SRT input in the same priority chain (the bonding handles single-modem failures inside the link; failover handles the case where the cellular link is completely down)
Return feed: SRT back to the venue for IFB (interruptible foldback) and confidence monitoring

In REMI workflows, every camera is an independent failover chain. Vajracast handles this by creating separate routes for each camera, each with its own priority chain and health monitoring. For real-world REMI deployment strategies including Starlink connectivity, see our remote production with SRT guide. The diagram view in the UI makes it straightforward to visualize and manage dozens of routes simultaneously.

Monitoring and Alerting for Failover Events

Failover that you cannot observe is failover you cannot trust. Effective monitoring has three layers:

Real-Time Dashboard

Vajracast’s web interface shows the status of every input in every route:

Green: healthy, active
Yellow: connected but degraded (high loss, low bitrate)
Red: disconnected or failed
Active indicator showing which input in the priority chain is currently feeding the output

The diagram view provides a visual map of all routes, with real-time status overlays on every connection.

Prometheus Metrics

Vajracast exposes 50+ metrics via a /metrics endpoint compatible with Prometheus. Failover-related metrics include:

vajracast_input_status{route="sports_main", input="primary"} 1
vajracast_input_status{route="sports_main", input="backup1"} 1
vajracast_failover_events_total{route="sports_main"} 3
vajracast_failover_last_timestamp{route="sports_main"} 1707523200
vajracast_input_bitrate_bps{route="sports_main", input="primary"} 8500000
vajracast_input_packet_loss{route="sports_main", input="primary"} 0.002

These metrics can be graphed in Grafana (pre-built dashboards are included) and used to trigger alerts via Alertmanager. For example: “Alert if any route has executed more than 2 failover events in the past hour.”

Event Logging and Webhooks

Every failover event is logged with:

Timestamp
Route name
Source input (which failed)
Target input (which took over)
Reason (timeout, packet loss threshold, bitrate drop, manual switch)
Duration on backup before recovery

This log is invaluable for post-event analysis. If failover triggered during a broadcast, you can trace exactly what happened, when, and why.

Best Practices for Configuring Failover

1. Use Independent Network Paths

If your primary and backup inputs share the same network switch, ISP, or cable run, a single network failure takes out both. True redundancy requires independent paths:

Different ISPs for primary and backup
Different physical network interfaces
Different cable runs (separate conduit)
For cellular backup, different carriers

2. Test Your Failover Regularly

A failover system that has never been tested is not a failover system. It is a hope. Schedule regular failover drills:

Pull the primary encoder’s network cable during a test stream
Kill the encoder process and measure switch time
Inject packet loss using network simulation tools (tc netem on Linux) to test threshold detection
Verify that auto-recovery works when the primary comes back

Test under load. Failover behavior can differ when the system is handling 50 routes versus 2.

3. Tune Your Thresholds

Default thresholds are a starting point. Tune them based on your specific environment:

Timeout too aggressive (e.g., 50ms): causes false failovers on momentary network jitter
Timeout too conservative (e.g., 5 seconds): viewers see 5 seconds of broken video before the switch
Recommended starting point: 200-500ms timeout, 10% packet loss threshold, 50% bitrate floor

Monitor your failover event log. If you see frequent failovers followed by immediate recovery, your thresholds are too aggressive.

4. Always Have a Bars & Tone Fallback in Last Position

The last input in your priority chain should be something that cannot fail. Vajracast ships a built-in Bars & Tone generator for exactly this purpose: a virtual input that produces a real MPEG-TS stream locally without any external source, network, or encoder. Since it runs on the server itself, it is always available. There is nothing to disconnect.

The generator is not a static image. It is a professional-grade test pattern signal:

Video patterns: SMPTE bars (75% or 100% HD), PAL 100% bars, or FFmpeg testsrc2 with moving elements
Six presets covering 1080p25, 1080i50, 576i50 (8-channel audio), 720p50 with clock, 1080p25 HEVC, and 540p25 for low-bitrate scenarios
Text overlay: burned-in company name, channel identifier, or custom message with configurable font, size, color, and position
Clock overlay: server-time burn-in for lip-sync debugging and live-proof timestamping
Frame-identifiable animation: square pulse or staircase pulse makes the pattern unambiguous frame by frame
Logo overlay: PNG logo positioned in a corner
Audio tone: 1 kHz, 400 Hz, 440 Hz, or silence; 2, 4, or 8 channels; configurable level from 0 to -20 dBFS; optional sweep mode that cycles the tone channel by channel

Beyond emergency fallback, the same generator handles several workflows that keep routes warm and reduce test friction:

Downstream validation: spawn a Bars & Tone input before live production starts. Your HLS viewers, SRT callers, and multiviewers immediately receive a known-good signal. If the viewer sees nothing, the problem is downstream, not upstream
Warm slot holding: while the OB van is preparing, the route stays live on bars. Remote decoders stay connected, no SRT disconnect to deal with. When the real signal arrives, a manual failover (Set Active) or a configured priority chain takes over cleanly
Audio channel verification: sweep mode runs the tone across every channel so your downstream operator can confirm 5.1 or 7.1 cabling is correct
Lip-sync measurement: the animated pattern combined with the tone lets you measure downstream audio/video offset visually
Prospect demos: show a complete Vajracast workflow from a conference room with zero physical equipment

Configure it as priority N+1 in your failover chain. If every live input drops, viewers see an intentional test pattern with your channel name rather than a frozen frame or a broken player.

5. Monitor Your Backup Sources

A backup source that is offline when you need it is worthless. Hot standby monitoring is not just about readiness. It is about continuously validating that the backup is healthy. Vajracast monitors all inputs in a priority chain equally, whether they are active or on standby. If your backup goes down, you know immediately, not when the primary fails and the backup fails to take over.

6. Plan for Gateway-Level Redundancy

Failover protects against input failure. But what about gateway failure? For the highest reliability, run two Vajracast instances:

Primary gateway handles all production routes
Secondary gateway mirrors the configuration and can take over via DNS failover or load balancer health checks
Both instances can use the same Docker/Kubernetes deployment infrastructure

How Vajracast Compares to Other Failover Solutions

Feature	Vajracast	Hardware Switcher	Cloud Failover (AWS)	Manual Switching
Switching speed	<200ms	<50ms (frame-accurate)	2-10s	5-30s (human reaction)
Protocol support	SRT, RTMP, RTSP, HLS, SRTLA, UDP, HTTP	SDI/HDMI only	RTMP, HLS	Any
Inputs per chain	Unlimited	2-4 (hardware dependent)	Varies	N/A
Monitoring	Built-in + Prometheus	Typically minimal	CloudWatch	None
Cost	Software license	$5,000-$50,000+	Per-minute compute	Labor cost
Remote management	Full web UI + REST API	Limited or none	AWS Console/API	Physical presence
Scalability	50+ routes per instance	1 route per device	Elastic but expensive	Not scalable

Hardware switchers excel at frame-accurate switching for SDI workflows but cannot handle IP-based multi-protocol environments. Cloud solutions introduce latency and per-minute costs that add up fast. Manual switching is inherently unreliable because it depends on a human being awake, alert, and fast.

Vajracast occupies the middle ground: software-defined, IP-native, multi-protocol, and automated, at a fraction of the cost of hardware or cloud alternatives.

Putting It All Together

For a real-world reference of a redundant Vajracast deployment with multi-input failover across two ingests and four restream regions, see the example deployment — annotated diagrams with hover details on every node.

A complete failover setup in Vajracast follows this structure:

Define your route: one output destination (e.g., SRT push to CDN)
Add primary input: your main encoder, highest priority
Add backup inputs: in priority order, each on an independent path
Add a static fallback: lowest priority, guaranteed availability
Configure health thresholds: timeout, packet loss, bitrate floor
Set recovery behavior: auto-recover with hold-off timer, or manual
Connect monitoring: Prometheus scraping, Grafana dashboards, alerting
Test everything: simulate failures before going live

With this configuration, your stream is protected against encoder failure, network outage, protocol issues, and even complete venue connectivity loss. The system handles it all automatically, silently, and reliably.

For a step-by-step setup guide, see SRT Streaming Setup: From Zero to Production. For the broader architecture of stream routing and distribution, continue to Live Stream Routing: The Complete Guide.

Next Steps

Broadcast Hub: the central routing platform that manages failover across regions
SRT Streaming Gateway: the complete guide to SRT-based video infrastructure
Video Failover Best Practices: shorter, tactical guide to failover configuration
SRT vs RTMP: understand the protocol trade-offs that affect failover performance
Live Stream Routing: how to route, split, and manage video signals across your infrastructure

Distribute live broadcast from the cloud

Managed cloud platform with dedicated servers, N+1 failover, hardware transcoding, and global delivery. Free for 30 days.

Start free trial See pricing

30 days free · No credit card · Direct access to the dev team

Frequently Asked Questions

What is video stream failover?

Video stream failover is an automatic mechanism that switches to a backup video source when the primary source fails, ensuring continuous streaming without interruption.

How fast should failover switching be?

Professional broadcast failover should switch in under 500ms. Vajracast achieves sub-50ms switchover by pre-buffering backup sources in hot standby, with total end-to-end failover (including detection) under 200ms.

Can I have multiple backup sources?

Yes. Vajracast supports N+1 redundancy with unlimited backup sources in a priority chain. Each source is independently monitored with configurable health thresholds.

Does failover work with different protocols?

Yes. A priority chain can mix SRT, RTMP, RTSP, HLS, UDP, and HTTP inputs. SRTLA-bonded inputs are also supported — the receiver deaggregates them into standard SRT before the routing engine sees them, so they behave like any other SRT input in the chain. The failover mechanism is the same regardless of input type.

What Is Video Stream Failover?

Why Failover Matters More Than Ever

Types of Failover: Hot, Warm, and Cold Standby

Hot Standby

Warm Standby

Cold Standby

How Vajracast Implements Failover

Priority Chains

Health Monitoring

Sub-200ms Switching

Failover Selection Strategy

Optional Auto-Failback

Protocol-Agnostic Failover

SRTLA is not failover

Real-World Failover Use Cases

Live Sports Broadcasting

24/7 Linear Channels

Remote Production (REMI)

Monitoring and Alerting for Failover Events

Real-Time Dashboard

Prometheus Metrics

Event Logging and Webhooks

Best Practices for Configuring Failover

1. Use Independent Network Paths

2. Test Your Failover Regularly

3. Tune Your Thresholds

4. Always Have a Bars & Tone Fallback in Last Position

5. Monitor Your Backup Sources

6. Plan for Gateway-Level Redundancy

How Vajracast Compares to Other Failover Solutions

Putting It All Together

Next Steps

Related Guides

Multi-Input Failover: N+1 Redundancy for Live Streams

Crash Recovery: Automatic Stream Restoration After Failures

SRT Redundancy: Building Fault-Tolerant SRT Workflows

Frequently Asked Questions