Cascading Pipeline

The Cascading Pipeline component provides a flexible, modular approach to building AI agents by allowing you to mix and match different components for Speech-to-Text (STT), Large Language Models (LLM), Text-to-Speech (TTS), Voice Activity Detection (VAD), and Turn Detection.

Core Architecture

The pipeline is composed of five key stages that work in sequence to handle a conversation:

VAD (Voice Activity Detection) - Detects the presence of human speech in the audio stream to know when to start processing.
STT (Speech-to-Text) - Converts the detected speech from audio into a text transcript.
LLM (Large Language Model) - Takes the text transcript as input, processes it, and generates a meaningful response.
TTS (Text-to-Speech) - Converts the LLM's text response back into audible speech.
Turn Detection - Manages the back-and-forth of the conversation, determining when one speaker has finished and another can begin.

Cascading Pipeline Architecture

Basic Usage

Simple Pipeline

Here is the most basic setup, combining STT, LLM, and TTS components. The SDK will use default configurations if no specific settings are provided.

main.py
from videosdk.agents import CascadingPipeline
from videosdk.plugins.deepgram import DeepgramSTT
from videosdk.plugins.openai import OpenAILLM
from videosdk.plugins.elevenlabs import ElevenLabsTTS
from videosdk.plugins.silero import SileroVAD
from videosdk.plugins.turn_detector import TurnDetector

pipeline = CascadingPipeline(
    stt=DeepgramSTT(),
    llm=OpenAILLM(),
    tts=ElevenLabsTTS(),
    vad=SileroVAD(),
    turn_detector=TurnDetector()
)

Key Features:

Modular Component Selection - Choose different providers for each component
Flexible Configuration - Mix and match STT, LLM, TTS, VAD, and Turn Detection
Custom Processing - Add custom processing for STT and LLM outputs
Provider Agnostic - Support for multiple AI service providers
Advanced Control - Fine-tune each component independently

Advance Configuration

You can fine-tune the behavior of each component by passing specific parameters during initialization.

main.py
from videosdk.agents import CascadingPipeline
from videosdk.plugins.openai import OpenAILLM
from videosdk.plugins.deepgram import DeepgramSTT
from videosdk.plugins.silero import SileroVAD
from videosdk.plugins.turn_detector import TurnDetector

stt=DeepgramSTT(
    model="nova-2",
    language="en",
    punctuate=True,
    diarize=True
),
llm=OpenAILLM(
    model="gpt-4o",
    temperature=0.7,
    max_tokens=1000
),
tts=ElevenLabsTTS(
    model="eleven_flash_v2_5",
    voice_id="21m00Tcm4TlvDq8ikWAM"
),
vad=SileroVAD(
    threshold=0.35,
    min_silence_duration=0.5
),
turn_detector=TurnDetector(
    threshold=0.8,
    min_turn_duration=1.0
)

pipeline = CascadingPipeline(stt=stt, llm=llm, tts=tts, vad=vad, turn_detector=turn_detector)

Dynamic Component Changes

The pipeline supports runtime component swapping

# Change components during runtime
await pipeline.change_component(
    stt=new_stt_provider,
    llm=new_llm_provider,
    tts=new_tts_provider
)

Plugin Ecosystem

There are multiple plugins available for STT, LLM, & TTS. Checkout here:

STT

Learn more about other STT plugins

LLM

Learn more about other LLM plugins

TTS

Learn more about other TTS plugins

Plugin Development

Creating Custom Plugins

To create custom plugins, follow the plugin development guide ↗.

Key requirements include:

Inherit from the correct base class (STT, LLM, or TTS)
Implement all abstract methods
Handle errors consistently using self.emit("error", message)
Clean up resources in the aclose() method

Plugin Installation

Install additional plugins as needed:

# Install specific provider plugins
pip install videosdk-plugins-openai
pip install videosdk-plugins-elevenlabs
pip install videosdk-plugins-deepgram

Best Practices

Component Selection: Choose providers based on your specific requirements (latency, quality, cost)
Error Handling: Implement proper error handling and fallback strategies
Resource Management: Use the cleanup() method to properly close components.
Configuration Monitoring: Use get_component_configs() for debugging and monitoring
Audio Format: Ensure your custom plugins handle the 48kHz audio format correctly

Key Benefits

The Cascading Pipeline offers several advantages over integrated solutions:

Multi-language Support - Use specialized STT for different languages
Cost Optimization - Mix premium and cost-effective services
Custom Voice Processing - Add domain-specific processing logic
Performance Optimization - Choose fastest providers for each component
Compliance Requirements - Use specific providers for regulatory compliance

Comparison with Realtime Pipeline

Feature	Cascading Pipeline	Realtime Pipeline
Control	Maximum control over each component	Integrated model control
Flexibility	Mix different providers	Single model provider
Latency	Higher due to sequential processing	Lower with streaming
Customization	Extensive customization options	Limited to model capabilities
Complexity	More complex configuration	Simpler setup

The Cascading Pipeline is ideal when you need maximum flexibility and control over each processing stage, while the Realtime Pipeline is better for low-latency applications with integrated model providers.

Examples - Try Out Yourself

We have examples to get you started. Go ahead, try out, talk to agent, understand and customize according to your needs.

Basic Implementation

Checkout the cascading pipeline implementation

Got a Question? Ask us on discord

Core Architecture​

Basic Usage​

Simple Pipeline​

Key Features:​

Advance Configuration​

Dynamic Component Changes​

Plugin Ecosystem​

STT

LLM

TTS

Plugin Development​

Creating Custom Plugins​

Plugin Installation​

Best Practices​

Key Benefits​

Comparison with Realtime Pipeline​

Examples - Try Out Yourself​