openclaw-skill/openclaw-knowhow-skill/docs/infrastructure/nodes/audio.md

# Audio and Voice Notes Documentation

## Overview

OpenClaw supports audio transcription with flexible configuration options. The system automatically detects available transcription tools or allows explicit provider/CLI setup.

## Key Capabilities

When audio understanding is enabled, OpenClaw locates the first audio attachment (local path or URL) and downloads it if needed before processing through configured models in sequence until one succeeds.

## Auto-Detection Hierarchy

Without custom configuration, the system attempts transcription in this order:
- Local CLI tools (sherpa-onnx-offline, whisper-cli, whisper Python CLI)
- Gemini CLI
- Provider APIs (OpenAI, Groq, Deepgram, Google)

## Configuration Options

Three configuration patterns are provided:

1. **Provider with CLI fallback** – Uses OpenAI with Whisper CLI as backup
2. **Provider-only with scope gating** – Restricts to specific chat contexts (e.g., denying group chats)
3. **Single provider** – Deepgram example for dedicated service use

## Important Constraints

Default size cap is 20MB (`tools.media.audio.maxBytes`). Oversize audio is skipped for that model and the next entry is tried.

Authentication follows standard model auth patterns. The transcript output is available as `{{Transcript}}` for downstream processing, with optional character trimming via `maxChars`.

## Notable Gotchas

Scope rules use first-match evaluation, CLI commands must exit cleanly with plain text output, and timeouts should be reasonable to prevent blocking the reply queue.