← Back to Innovations
Creative & Generative Media Model Speech & Audio Production

MAI-Voice-2

Multilingual TTS with Voice Cloning

Try MAI-Voice-2 on Microsoft Foundry → Try on Microsoft Foundry →
MAI-Voice-2

About MAI-Voice-2

MAI-Voice-2 is Microsoft AI’s updated multilingual text-to-speech model, adding two headline capabilities on top of MAI-Voice-1’s neural synthesis: zero-shot voice cloning from a 10-to-60-second reference clip and voice prompting that uses a short audio example to shape tone, emotion, accent, and pacing. Emotion tags give developers granular control over delivery, and the model maintains stable speaker identity across long-form content — audiobooks, podcasts, lectures, and serialized assistants — letting a single voice persona carry across markets without managing a separate library per scenario.

In blind preference tests, listeners chose MAI-Voice-2 over MAI-Voice-1 72% of the time, citing clearer articulation, more natural pacing, and higher audio fidelity. In blind listening tests against real human speech, evaluators often could not reliably tell the model apart. The result is production-grade voice across assistants, entertainment, accessibility, education, and creator tooling — voice cloning for branded spokespersons and personalized assistants, voice prompting for fast iteration without a managed voice library, and emotion tags for fine-grained delivery in interactive scenarios.

Key capabilities

  • Zero-shot voice cloning from 10–60 seconds of reference audio
  • Voice prompting to shape tone, emotion, accent, and pacing from a short example
  • Granular emotion control via emotion tags for fine-grained delivery
  • Preferred over MAI-Voice-1 by 72% in blind tests; often indistinguishable from human speech
  • Stable speaker identity across long-form content — audiobooks, podcasts, lectures

Ready to Explore?

Dive into platform integrations, source code, research papers, and announcements.