Skip to main content

Overview

The VideoSDK AI Agent SDK provides a powerful framework for building AI agents that can participate in real-time conversations. This guide explains the core components and demonstrates how to create a complete agentic workflow. The SDK serves as a real-time bridge between AI models and your users, facilitating seamless voice and media interactions.

Architecture

The Agent Session orchestrates the entire workflow, combining the Agent with a Pipeline for real-time communication. You can use a direct Realtime Pipeline for speech-to-speech, or a Cascading Pipeline with a Conversation Flow for modular STT-LLM-TTS control.

overview

  1. Agent - This is the base class for defining your agent's identity and behavior. Here, you can configure custom instructions, manage its state, and register function tools.
  2. Pipeline - This component manages the real-time flow of audio and data between the user and the AI models. The SDK offers two types of pipelines:
    • Realtime Pipeline - A speech to speech pipeline where there is no need for converting speech to text or text to speech and no llm to configure in between.
    • Cascading Pipleine - The traditional STT-LLM-TTS pipeline which allows flexibility to mix and match different providers for Speech-to-Text (STT), Large Language Models (LLM), and Text-to-Speech (TTS).
  3. Agent Session - This component brings together the agent, pipeline, and conversation flow to manage the agent's lifecycle within a VideoSDK meeting.
  4. Conversation Flow - This inheritable class works with the CascadingPipeline to let you define custom turn-taking logic and preprocess transcripts.

Supporting Components

These components work behind the scenes to support the core functionality of the AI Agent SDK:

  • Execution & Lifecycle Management

    • JobContext - Provides the execution environment and lifecycle management for AI agents. It encapsulates the context in which an agent job is running.

    • WorkerJob - Manages the execution of jobs and worker processes using Python's multiprocessing, allowing for concurrent agent operations.

  • Configuration & Settings

    • RoomOptions - This allows you to configure the behavior of the session, including room settings and other advanced features for the agent's interaction within a meeting.

    • Options - This is used to configure the behavior of the worker, including logging and other execution settings.

  • External Integration

    • MCP Servers - These enable the integration of external tools through either stdio or HTTP transport.
      • MCPServerStdio - Facilitates direct process communication for local Python scripts.
      • MCPServerHTTP - Enables HTTP-based communication for remote servers and services.

Advanced Features

The AI Agent SDK includes a range of advanced features to build sophisticated conversational agents:

Examples - Try out yourself

We have examples to get you started. Go ahead, try out, talk to agent and customize according to your needs.

Got a Question? Ask us on discord