Cascading Pipeline
The Cascading Pipeline
component provides a flexible, modular approach to building AI agents by allowing you to mix and match different components for Speech-to-Text (STT), Large Language Models (LLM), Text-to-Speech (TTS), Voice Activity Detection (VAD), and Turn Detection.
Core Architecture
The pipeline is composed of five key stages that work in sequence to handle a conversation:
- VAD (Voice Activity Detection) - Detects the presence of human speech in the audio stream to know when to start processing.
- STT (Speech-to-Text) - Converts the detected speech from audio into a text transcript.
- LLM (Large Language Model) - Takes the text transcript as input, processes it, and generates a meaningful response.
- TTS (Text-to-Speech) - Converts the LLM's text response back into audible speech.
- Turn Detection - Manages the back-and-forth of the conversation, determining when one speaker has finished and another can begin.
Basic Usage
Simple Pipeline
Here is the most basic setup, combining STT, LLM, and TTS components. The SDK will use default configurations if no specific settings are provided.
from videosdk.agents import CascadingPipeline
from videosdk.plugins.deepgram import DeepgramSTT
from videosdk.plugins.openai import OpenAILLM
from videosdk.plugins.elevenlabs import ElevenLabsTTS
from videosdk.plugins.silero import SileroVAD
from videosdk.plugins.turn_detector import TurnDetector
pipeline = CascadingPipeline(
stt=DeepgramSTT(),
llm=OpenAILLM(),
tts=ElevenLabsTTS(),
vad=SileroVAD(),
turn_detector=TurnDetector()
)
Key Features:
- Modular Component Selection - Choose different providers for each component
- Flexible Configuration - Mix and match STT, LLM, TTS, VAD, and Turn Detection
- Custom Processing - Add custom processing for STT and LLM outputs
- Provider Agnostic - Support for multiple AI service providers
- Advanced Control - Fine-tune each component independently
Advance Configuration
You can fine-tune the behavior of each component by passing specific parameters during initialization.
from videosdk.agents import CascadingPipeline
from videosdk.plugins.openai import OpenAILLM
from videosdk.plugins.deepgram import DeepgramSTT
from videosdk.plugins.silero import SileroVAD
from videosdk.plugins.turn_detector import TurnDetector
stt=DeepgramSTT(
model="nova-2",
language="en",
punctuate=True,
diarize=True
),
llm=OpenAILLM(
model="gpt-4o",
temperature=0.7,
max_tokens=1000
),
tts=ElevenLabsTTS(
model="eleven_flash_v2_5",
voice_id="21m00Tcm4TlvDq8ikWAM"
),
vad=SileroVAD(
threshold=0.35,
min_silence_duration=0.5
),
turn_detector=TurnDetector(
threshold=0.8,
min_turn_duration=1.0
)
pipeline = CascadingPipeline(stt=stt, llm=llm, tts=tts, vad=vad, turn_detector=turn_detector)
Dynamic Component Changes
The pipeline supports runtime component swapping
# Change components during runtime
await pipeline.change_component(
stt=new_stt_provider,
llm=new_llm_provider,
tts=new_tts_provider
)
Plugin Ecosystem
There are multiple plugins available for STT, LLM, & TTS. Checkout here:
STT
Learn more about other STT plugins
LLM
Learn more about other LLM plugins
TTS
Learn more about other TTS plugins
Plugin Development
Creating Custom Plugins
To create custom plugins, follow the plugin development guide.
Key requirements include:
- Inherit from the correct base class (
STT
,LLM
, orTTS
) - Implement all abstract methods
- Handle errors consistently using
self.emit("error", message)
- Clean up resources in the
aclose()
method
Plugin Installation
Install additional plugins as needed:
# Install specific provider plugins
pip install videosdk-plugins-openai
pip install videosdk-plugins-elevenlabs
pip install videosdk-plugins-deepgram
Best Practices
- Component Selection: Choose providers based on your specific requirements (latency, quality, cost)
- Error Handling: Implement proper error handling and fallback strategies
- Resource Management: Use the
cleanup()
method to properly close components. - Configuration Monitoring: Use
get_component_configs()
for debugging and monitoring - Audio Format: Ensure your custom plugins handle the 48kHz audio format correctly
Key Benefits
The Cascading Pipeline offers several advantages over integrated solutions:
- Multi-language Support - Use specialized STT for different languages
- Cost Optimization - Mix premium and cost-effective services
- Custom Voice Processing - Add domain-specific processing logic
- Performance Optimization - Choose fastest providers for each component
- Compliance Requirements - Use specific providers for regulatory compliance
Comparison with Realtime Pipeline
Feature | Cascading Pipeline | Realtime Pipeline |
---|---|---|
Control | Maximum control over each component | Integrated model control |
Flexibility | Mix different providers | Single model provider |
Latency | Higher due to sequential processing | Lower with streaming |
Customization | Extensive customization options | Limited to model capabilities |
Complexity | More complex configuration | Simpler setup |
The Cascading Pipeline is ideal when you need maximum flexibility and control over each processing stage, while the Realtime Pipeline is better for low-latency applications with integrated model providers.
Got a Question? Ask us on discord