Microsoft takes on AI rivals with three new foundational models

Microsoft AI has released three new foundational models for generating text, voice, and images. This marks its continued push to build a multimodal AI stack and compete with rivals, even while maintaining its partnership with OpenAI.

The models are MAI-Transcribe-1 for fast speech transcription across 25 languages, MAI-Voice-1 for generating custom audio, and MAI-Image-2 for video generation. They were developed by Microsoft’s Superintelligence team led by CEO Mustafa Suleyman.

A key selling point is competitive pricing, starting at $0.36 per hour for transcription. Suleyman emphasized the company’s “Humanist AI” approach, focusing on practical use. He also reaffirmed Microsoft’s commitment to OpenAI, noting a recent partnership renegotiation enabled this independent research. The models are available on Microsoft Foundry and MAI Playground.