VibeVoice ASR
Long-Form Multi-Speaker Transcription
About VibeVoice ASR
VibeVoice ASR is a unified speech-to-text model that transcribes up to 60 minutes of continuous audio in a single pass, capturing speaker identity, timestamps, and content in one structured output. The roughly 9-billion-parameter architecture jointly performs automatic speech recognition, speaker diarization, and temporal alignment without audio segmentation, supports more than 50 languages with native code-switching, and accepts customer-provided hotwords for domain-specific vocabulary. A 64K-token context window lets it maintain consistent speaker identity across the full hour of audio.
Jointly performing ASR, diarization, and timestamping removes a long-standing bottleneck in speech workflows that traditionally chained together separate models and brittle post-processing. By preserving global context across long recordings, VibeVoice ASR avoids the speaker-confusion errors that chunk-based pipelines accumulate. The result is searchable, speaker-attributed transcripts of meetings, interviews, and multilingual discussions — directly useful for content management, compliance, accessibility, and downstream agents that consume audio as a first-class modality.
Key capabilities
- 60-minute single-pass transcription within 64K token context
- Joint ASR, speaker diarization, and timestamping in one model
- Supports 50+ languages with hotword biasing
- Structured "who said what and when" output
- Optimized for vLLM serving with PyTorch and Transformers
Ready to Explore?
Dive into platform integrations, source code, research papers, and announcements.