← Back to Innovations
Creative & Generative Media Model Speech & Audio Experimental

VibeVoice ASR

Long-Form Multi-Speaker Transcription

469,910 USERS
Try VibeVoice ASR on Microsoft Foundry → Try on Microsoft Foundry →
VibeVoice ASR

About VibeVoice ASR

VibeVoice ASR is a unified speech-to-text model that transcribes up to 60 minutes of continuous audio in a single pass, capturing speaker identity, timestamps, and content in one structured output. The roughly 9-billion-parameter architecture jointly performs automatic speech recognition, speaker diarization, and temporal alignment without audio segmentation, supports more than 50 languages with native code-switching, and accepts customer-provided hotwords for domain-specific vocabulary. A 64K-token context window lets it maintain consistent speaker identity across the full hour of audio.

Jointly performing ASR, diarization, and timestamping removes a long-standing bottleneck in speech workflows that traditionally chained together separate models and brittle post-processing. By preserving global context across long recordings, VibeVoice ASR avoids the speaker-confusion errors that chunk-based pipelines accumulate. The result is searchable, speaker-attributed transcripts of meetings, interviews, and multilingual discussions — directly useful for content management, compliance, accessibility, and downstream agents that consume audio as a first-class modality.

Key capabilities

  • 60-minute single-pass transcription within 64K token context
  • Joint ASR, speaker diarization, and timestamping in one model
  • Supports 50+ languages with hotword biasing
  • Structured "who said what and when" output
  • Optimized for vLLM serving with PyTorch and Transformers
Technology Stack
PyTorch Transformers vLLM
Technology Stack
PyTorch Transformers vLLM