VibeVoice ASR

Long-Form Multi-Speaker Transcription

469,910 USERS

Try VibeVoice ASR on Microsoft Foundry → Try on Microsoft Foundry →

About VibeVoice ASR

VibeVoice ASR is a unified speech-to-text model that transcribes up to 60 minutes of continuous audio in a single pass, capturing speaker identity, timestamps, and content in one structured output. The roughly 9-billion-parameter architecture jointly performs automatic speech recognition, speaker diarization, and temporal alignment without audio segmentation, supports more than 50 languages with native code-switching, and accepts customer-provided hotwords for domain-specific vocabulary. A 64K-token context window lets it maintain consistent speaker identity across the full hour of audio.

Jointly performing ASR, diarization, and timestamping removes a long-standing bottleneck in speech workflows that traditionally chained together separate models and brittle post-processing. By preserving global context across long recordings, VibeVoice ASR avoids the speaker-confusion errors that chunk-based pipelines accumulate. The result is searchable, speaker-attributed transcripts of meetings, interviews, and multilingual discussions — directly useful for content management, compliance, accessibility, and downstream agents that consume audio as a first-class modality.

Key capabilities

60-minute single-pass transcription within 64K token context
Joint ASR, speaker diarization, and timestamping in one model
Supports 50+ languages with hotword biasing
Structured "who said what and when" output
Optimized for vLLM serving with PyTorch and Transformers

Technology Stack

PyTorch Transformers vLLM

Technology Stack

PyTorch Transformers vLLM

Ready to Explore?

Dive into platform integrations, source code, research papers, and announcements.

PLATFORM Microsoft Foundry Try VibeVoice ASR in the Microsoft Foundry model catalog. EXPLORE ON FOUNDRY CODE GitHub Repository Browse the open-source codebase and contribute. VIEW REPOSITORY ACADEMIC Research Paper Read the peer-reviewed publication. READ PAPER