Skip to main content

Cascading Pipeline

The Cascading Pipeline component provides a flexible, modular approach to building AI agents by allowing you to mix and match different components for Speech-to-Text (STT), Large Language Models (LLM), Text-to-Speech (TTS), Voice Activity Detection (VAD), and Turn Detection.

Core Architecture

The pipeline is composed of five key stages that work in sequence to handle a conversation:

  • VAD (Voice Activity Detection) - Detects the presence of human speech in the audio stream to know when to start processing.
  • STT (Speech-to-Text) - Converts the detected speech from audio into a text transcript.
  • LLM (Large Language Model) - Takes the text transcript as input, processes it, and generates a meaningful response.
  • TTS (Text-to-Speech) - Converts the LLM's text response back into audible speech.
  • Turn Detection - Manages the back-and-forth of the conversation, determining when one speaker has finished and another can begin.

Cascading Pipeline Architecture

Basic Usage

Simple Pipeline

Here is the most basic setup, combining STT, LLM, and TTS components. The SDK will use default configurations if no specific settings are provided.

from videosdk.agents import CascadingPipeline
from videosdk.plugins.deepgram import DeepgramSTT
from videosdk.plugins.openai import OpenAILLM
from videosdk.plugins.elevenlabs import ElevenLabsTTS
from videosdk.plugins.silero import SileroVAD
from videosdk.plugins.turn_detector import TurnDetector

pipeline = CascadingPipeline(
stt=DeepgramSTT(),
llm=OpenAILLM(),
tts=ElevenLabsTTS(),
vad=SileroVAD(),
turn_detector=TurnDetector()
)

Key Features:

  • Modular Component Selection - Choose different providers for each component
  • Flexible Configuration - Mix and match STT, LLM, TTS, VAD, and Turn Detection
  • Custom Processing - Add custom processing for STT and LLM outputs
  • Provider Agnostic - Support for multiple AI service providers
  • Advanced Control - Fine-tune each component independently

Advance Configuration

You can fine-tune the behavior of each component by passing specific parameters during initialization.

from videosdk.agents import CascadingPipeline
from videosdk.plugins.openai import OpenAILLM
from videosdk.plugins.deepgram import DeepgramSTT
from videosdk.plugins.silero import SileroVAD
from videosdk.plugins.turn_detector import TurnDetector

stt=DeepgramSTT(
model="nova-2",
language="en",
punctuate=True,
diarize=True
),
llm=OpenAILLM(
model="gpt-4o",
temperature=0.7,
max_tokens=1000
),
tts=ElevenLabsTTS(
model="eleven_flash_v2_5",
voice_id="21m00Tcm4TlvDq8ikWAM"
),
vad=SileroVAD(
threshold=0.35,
min_silence_duration=0.5
),
turn_detector=TurnDetector(
threshold=0.8,
min_turn_duration=1.0
)

pipeline = CascadingPipeline(stt=stt, llm=llm, tts=tts, vad=vad, turn_detector=turn_detector)

Dynamic Component Changes

The pipeline supports runtime component swapping

# Change components during runtime  
await pipeline.change_component(
stt=new_stt_provider,
llm=new_llm_provider,
tts=new_tts_provider
)

Plugin Ecosystem

There are multiple plugins available for STT, LLM, & TTS. Checkout here:

Plugin Development

Creating Custom Plugins

To create custom plugins, follow the plugin development guide.

Key requirements include:

  • Inherit from the correct base class (STT, LLM, or TTS)
  • Implement all abstract methods
  • Handle errors consistently using self.emit("error", message)
  • Clean up resources in the aclose() method

Plugin Installation

Install additional plugins as needed:

# Install specific provider plugins  
pip install videosdk-plugins-openai
pip install videosdk-plugins-elevenlabs
pip install videosdk-plugins-deepgram

Best Practices

  1. Component Selection: Choose providers based on your specific requirements (latency, quality, cost)
  2. Error Handling: Implement proper error handling and fallback strategies
  3. Resource Management: Use the cleanup() method to properly close components.
  4. Configuration Monitoring: Use get_component_configs() for debugging and monitoring
  5. Audio Format: Ensure your custom plugins handle the 48kHz audio format correctly

Key Benefits

The Cascading Pipeline offers several advantages over integrated solutions:

  • Multi-language Support - Use specialized STT for different languages
  • Cost Optimization - Mix premium and cost-effective services
  • Custom Voice Processing - Add domain-specific processing logic
  • Performance Optimization - Choose fastest providers for each component
  • Compliance Requirements - Use specific providers for regulatory compliance

Comparison with Realtime Pipeline

FeatureCascading PipelineRealtime Pipeline
ControlMaximum control over each componentIntegrated model control
FlexibilityMix different providersSingle model provider
LatencyHigher due to sequential processingLower with streaming
CustomizationExtensive customization optionsLimited to model capabilities
ComplexityMore complex configurationSimpler setup

The Cascading Pipeline is ideal when you need maximum flexibility and control over each processing stage, while the Realtime Pipeline is better for low-latency applications with integrated model providers.

Got a Question? Ask us on discord