Mistral releases a new open source model for speech generation

On Thursday, the French AI company Mistral released a new open source text-to-speech model. Named Voxtral TTS, this model is designed for use in voice AI assistants and enterprise applications like customer support and sales engagement. This launch places Mistral in direct competition with established players such as ElevenLabs, Deepgram, and OpenAI.

The Voxtral TTS model supports nine languages: English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic. According to Pierre Stock, Mistral AI’s VP of science operations, the model was built in response to customer demand. He described it as a small-sized speech model capable of running on edge devices like smartwatches, smartphones, and laptops. Stock emphasized that the model offers state-of-the-art performance at a cost that is a fraction of other options on the market.

A key feature of Voxtral TTS is its ability to adapt a custom voice using a sample of less than five seconds. It can capture nuanced characteristics including subtle accents, inflections, intonations, and irregularities in speech flow. Based on the Ministral 3B architecture, the model can also switch between languages seamlessly without losing the voice’s unique traits, making it useful for applications like dubbing and real-time translation. The company’s goal was to create a model that sounds human rather than robotic.

Mistral built the model for real-time performance. It boasts a time-to-first-audio of 90 milliseconds for a 10-second sample of 500 characters. Additionally, it has a real-time factor of 6x, meaning it can render a 10-second audio clip in approximately 1.6 seconds.

Earlier this year, Mistral launched a pair of transcription models for batch processing and low-latency real-time use. With this new speech model, the company appears to be building toward a comprehensive suite of voice products for businesses. Stock outlined a vision for an end-to-end platform capable of handling multimodal streams of input and output, including audio, text, and images. He stated that such a system provides more information and benefits through an agentic system supporting audio.

Mistral’s positioning in the market hinges on its open source approach and customization capabilities. The company believes that allowing enterprises to tune the model to their specific needs will encourage adoption over competing offerings.