Creative & Generative Media Model Speech & Audio Production

MAI-Voice-1

Neural Text-to-Speech

Try MAI-Voice-1 on Microsoft Foundry → Try on Microsoft Foundry →

About MAI-Voice-1

MAI-Voice-1 is a Microsoft AI neural text-to-speech model that generates a full minute of audio in under one second on a single GPU. It supports SSML-driven emotion and style control, maintains a stable voice persona across long-form content, and offers sub-100 ms latency for interactive workloads via the Azure Speech SDK. The model is designed for both streaming and batch synthesis, making it suitable for virtual assistants, accessibility tooling, narration, and audiobook production.

By combining extreme speed with expressive prosody and speaker consistency, MAI-Voice-1 enables a new class of real-time, context-aware voice experiences that were previously bottlenecked by synthesis latency. Tight integration with Azure services and SSML controls lets developers shape tone and style with fine granularity, supporting use cases from customer service to inclusive narration of long-form content. The model is part of Microsoft AI’s first-party media stack alongside MAI-Image-2 and MAI-Transcribe-1, giving developers a coherent suite for image, voice, and speech generation under one provider.

Key capabilities

Generates 1 minute of audio in <1 second on a single GPU
SSML-based emotion and style control
Stable voice persona across long-form content
Real-time synthesis via the Azure Speech SDK
Lightweight next-generation neural TTS architecture

Technology Stack

Neural TTS Azure Speech SDK

Technology Stack

Neural TTS Azure Speech SDK

Ready to Explore?

Dive into platform integrations, source code, research papers, and announcements.

PLATFORM Microsoft Foundry Try MAI-Voice-1 in the Microsoft Foundry model catalog. EXPLORE ON FOUNDRY BLOG Microsoft Blog See the latest updates from Microsoft Research. VISIT BLOG