forked from Selig/openclaw-skill
6 custom skills (assign-task, dispatch-webhook, daily-briefing, task-capture, qmd-brain, tts-voice) with technical documentation. Compatible with Claude Code, OpenClaw, Codex CLI, and OpenCode.
35 lines
1.5 KiB
Markdown
35 lines
1.5 KiB
Markdown
# Audio and Voice Notes Documentation
|
||
|
||
## Overview
|
||
|
||
OpenClaw supports audio transcription with flexible configuration options. The system automatically detects available transcription tools or allows explicit provider/CLI setup.
|
||
|
||
## Key Capabilities
|
||
|
||
When audio understanding is enabled, OpenClaw locates the first audio attachment (local path or URL) and downloads it if needed before processing through configured models in sequence until one succeeds.
|
||
|
||
## Auto-Detection Hierarchy
|
||
|
||
Without custom configuration, the system attempts transcription in this order:
|
||
- Local CLI tools (sherpa-onnx-offline, whisper-cli, whisper Python CLI)
|
||
- Gemini CLI
|
||
- Provider APIs (OpenAI, Groq, Deepgram, Google)
|
||
|
||
## Configuration Options
|
||
|
||
Three configuration patterns are provided:
|
||
|
||
1. **Provider with CLI fallback** – Uses OpenAI with Whisper CLI as backup
|
||
2. **Provider-only with scope gating** – Restricts to specific chat contexts (e.g., denying group chats)
|
||
3. **Single provider** – Deepgram example for dedicated service use
|
||
|
||
## Important Constraints
|
||
|
||
Default size cap is 20MB (`tools.media.audio.maxBytes`). Oversize audio is skipped for that model and the next entry is tried.
|
||
|
||
Authentication follows standard model auth patterns. The transcript output is available as `{{Transcript}}` for downstream processing, with optional character trimming via `maxChars`.
|
||
|
||
## Notable Gotchas
|
||
|
||
Scope rules use first-match evaluation, CLI commands must exit cleanly with plain text output, and timeouts should be reasonable to prevent blocking the reply queue.
|