Experiment

VibeVoice-ASR

Software-generated speech systems have made rapid progress in recent years, producing natural, high-fidelity audio for short, single-speaker utterances. However, existing approaches still struggle when scaling to long-form, multi-speaker conversations—such as podcasts or multi-character audiobooks—where natural turn-taking and contextual continuity are essential. While traditional systems can concatenate individual utterances to simulate dialogue, this often results in disjointed or unstable outputs.

 

To address this challenge, researchers introduce VibeVoice ASR, a unified speech‑to‑text model designed to transcribe up to 60 minutes of continuous audio in a single pass, while producing rich, structured output that captures who said what, and when. Built by Microsoft Research, and trained to produce structured transcriptions reliably, VibeVoice ASR represents a new approach to speech recognition—one that treats long‑form audio understanding as a first‑class problem rather than a collection of stitched‑together steps.

 

VibeVoice ASR moves beyond traditional automatic speech recognition pipelines by unifying transcription, speaker diarization, and timestamping into a single model and a single inference pass. Instead of slicing audio into short segments and reconciling results afterward, the model processes long recordings holistically, preserving global context across the entire conversation.