Microsoft AI has released three new foundational models for generating text, voice, and images. This marks its continued push to build a multimodal AI stack and compete with rivals, even while maintaining its partnership with OpenAI.
The models are MAI-Transcribe-1 for fast speech transcription across 25 languages, MAI-Voice-1 for generating custom audio, and MAI-Image-2 for video generation. They were developed by Microsoft’s Superintelligence team led by CEO Mustafa Suleyman.
A key selling point is competitive pricing, starting at $0.36 per hour for transcription. Suleyman emphasized the company’s “Humanist AI” approach, focusing on practical use. He also reaffirmed Microsoft’s commitment to OpenAI, noting a recent partnership renegotiation enabled this independent research. The models are available on Microsoft Foundry and MAI Playground.

