Collections

Generate speech

Generate natural-sounding speech from text. Clone voices, control emotions, and produce audio in dozens of languages.

Models we recommend

Best quality: MiniMax Speech 2.8 HD

MiniMax Speech 2.8 HD ranks #1 on TTS benchmarks, outperforming both OpenAI and ElevenLabs in blind evaluations. Studio-grade voice synthesis with 17+ preset voices, emotion control, voice cloning from just 5 seconds of audio, and support for 32+ languages. The best choice for voiceovers, audiobooks, and polished content.

Most expressive: ElevenLabs v3

ElevenLabs v3 delivers unprecedented expressiveness with audio tags like [excited], [whispers], and [sighs]. Supports 70+ languages and 26 voices. Requires more prompt engineering than other models but produces the most emotionally rich output. Great for film, audiobooks, and creative media.

Style-prompted: Gemini 3.1 Flash TTS

Gemini 3.1 Flash TTS from Google gives you fine-grained control over delivery through inline tags and style prompting. Set a scene, define a character, and direct the performance — "you must hear the grin in the audio." 30 voices, 70+ languages, and natural-sounding output with rich expressiveness.

Best for real-time: MiniMax Speech 2.8 Turbo

MiniMax Speech 2.8 Turbo is optimized for low-latency applications like voice agents, chatbots, and interactive experiences. Supports 40+ languages with the same voice cloning and emotion control as the HD version.

Ultra-low latency: Inworld TTS 1.5 Mini

Inworld TTS 1.5 Mini achieves ~120ms latency — the fastest in this collection. Supports 15 languages with emotion markups and SSML break tags. Inworld TTS 1.5 Max trades a bit of speed for higher quality at <200ms latency.

For voice cloning: Chatterbox

Chatterbox from Resemble AI excels at voice cloning with emotional control — generate distinct character voices from just a few seconds of reference audio. Great for games, animations, and storytelling.

Multilingual: ElevenLabs v2 Multilingual

ElevenLabs v2 Multilingual generates speech in 29 languages while maintaining consistent voice quality across all of them. Good for localization workflows where the same voice needs to work in multiple languages.

Open source: Tortoise TTS

Tortoise TTS is an open-source option that produces high-quality speech. Slower than the commercial models but fully self-hostable.

Frequently asked questions

Which model should I start with?

minimax/speech-2.8-hd is the best overall TTS model — it ranks #1 on benchmarks, supports 32+ languages, and includes voice cloning and emotion control. For real-time applications, use the turbo variant minimax/speech-2.8-turbo.

Which models are the fastest?

inworld/tts-1.5-mini achieves ~120ms latency — the fastest in this collection. minimax/speech-2.8-turbo is also designed for low-latency real-time use. Both are great for chatbots, voice agents, and interactive apps.

Which model sounds the most natural and expressive?

elevenlabs/v3 produces the most emotionally rich speech. It supports audio tags like [excited], [whispers], and [sighs] for fine-grained control. Requires more prompt engineering but delivers the best results for audiobooks, film, and creative media.

How do I clone a voice?

minimax/speech-2.8-hd and minimax/speech-2.8-turbo both support voice cloning from just 5 seconds of reference audio. resemble-ai/chatterbox is another option with emotional control, especially good for character voices in games and animation.

Which models support the most languages?

elevenlabs/v3 supports 70+ languages. minimax/speech-2.8-turbo and minimax/speech-2.8-hd support 40+ and 32+ languages respectively. elevenlabs/v2-multilingual supports 29 languages with consistent voice quality across all of them.

Can I control emotions in the speech?

Yes — most modern TTS models support emotion control. MiniMax models support happy, sad, angry, fearful, calm, and other emotions. ElevenLabs v3 uses audio tags for finer control. inworld/tts-1.5-max and inworld/tts-1.5-mini support emotion markups like [happy], [sad], plus non-verbal sounds like [laugh] and [sigh].

Is there an open-source option?

afiaka87/tortoise-tts is open-source and produces high-quality speech. It's slower than commercial models but can be self-hosted on your own hardware.

Can I use TTS models commercially?

Most models support commercial use. Some may include audio watermarking — check each model's license page for specifics, especially regarding voice cloning and redistribution.