← Back to Innovations
Creative & Generative Media Model Speech & Audio Production

MAI-Transcribe-1.5

High-Speed Multilingual Speech Recognition

Try MAI-Transcribe-1.5 on Microsoft Foundry → Try on Microsoft Foundry →
MAI-Transcribe-1.5

About MAI-Transcribe-1.5

MAI-Transcribe-1.5 is Microsoft AI’s updated speech-to-text model, adding entity biasing to the production-grade transcription accuracy of MAI-Transcribe-1. Developers prime the model with a list of domain-specific entities — meeting attendees, product names, medical vocabulary, scientific terms — and the model biases its predictions toward those entries when the acoustic signal is ambiguous. The system retains its #1 spot on the FLEURS benchmark with Word Error Rate improving from 3.9% to 3.7%, and on Artificial Analysis it reaches 2.5% overall WER at rank 3, handling cross-talk, background noise, and long-form meetings in production conditions.

The 1.5 generation also doubles down on the model’s speed and cost story: throughput is up roughly 3× and latency improved by 5.7×, letting the model transcribe an hour of audio in under 10 seconds — more than 5× faster than Gemini 3.1 Flash, ScribeV2, and gpt-4o-transcribe. The combination of best-in-class accuracy on multilingual benchmarks, content-aware biasing, and frontier-leading throughput positions MAI-Transcribe-1.5 as the production transcription layer for meetings, voice agents, and enterprise content workflows — where speed and domain accuracy matter as much as raw WER.

Key capabilities

  • Entity biasing — prime with names, brands, and domain vocabulary for ambiguous audio
  • Retains #1 on FLEURS with WER improving from 3.9% to 3.7%
  • Transcribes 1 hour of audio in under 10 seconds — over 5× faster than competitors
  • Throughput up ~3×, latency improved by 5.7× over MAI-Transcribe-1
  • Handles cross-talk, background noise, and long-form meetings in production conditions

Ready to Explore?

Dive into platform integrations, source code, research papers, and announcements.