New Expressive Real-Time Text-to-Speech (TTS) Models Support High-Fidelity Voice Cloning

Zyphra, a company based in Palo Alto, California, building a mew multimodal artificial intelligence agent system, announced the release of Zonos-v0.1 beta, a pair of extremely expressive text-to-speech (TTS) models with high fidelity voice cloning. The company is releasing both the transformer and hybrid TTS models with an Apache 2.0 license, which effectively makes these available for redistribution.
A visit to the Zyphra website allows experimenting with the two new text-to-speech (TTS) models and testing the unique ability to generate high-fidelity voice cloning that can then be applied to read any text with outstanding expressive features. The website also shows how Zonos performs better than leading TTS providers in quality and expressiveness.
Why would humanity need voice cloning (or AI-music generation) is a question that we all should ask, but as with anything regarding artificial intelligence, research is moving way faster than we have time to ponder about the consequences. And the Zonos models clearly show that recognized voice actors and announcers will need to leverage their unique voice “model” instead of just their work. And that deepfakes will be a much larger problem than imagined given our reliance on media content.
According to Zyphra, Zonos offers flexible control of vocal speed, emotion, tone, and audio quality as well as instant unlimited high-quality voice cloning. Zonos natively generates speech at 44KHz and is able to clone any voice from just 5 to 30 seconds of speech. Zonos enables highly expressive and natural speech generation from text prompts given a speaker embedding or audio prefix. Zonos also can be conditioned based on speaking rate, pitch standard deviation, audio quality, and emotions such as sadness, fear, anger, happiness, and surprise.
The models are trained on approximately 200,000 hours of speech data, encompassing both neutral-toned speech (like audiobook narration) and highly expressive speech. The majority of the data is English, although there are substantial amounts of Chinese, Japanese, French, Spanish, and German just to improve its usability.

“We believe that openly releasing models of this caliber will significantly advance TTS research. Currently, Zonos is a beta preview. While highly expressive, Zonos is sometimes unreliable in generations, leading to interesting artifacts. We are excited to continue pushing the frontiers of conversational agent performance, reliability, and efficiency over the coming months,” the company says in its blog.
“Our highly optimized inference engine powers both the Zonos API and playground, achieving impressive time-to-first-audio (TTFA) metrics. The hybrid model demonstrates particularly efficient performance characteristics, with reduced latency and memory overhead compared to its transformer counterpart, thanks to its Mamba2-based architecture that relies less heavily on attention blocks.”
“In future model releases, we aim to significantly improve the model’s reliability, its ability to handle specific pronunciations, the number of supported languages, and the level of control over emotions and other vocal characteristics afforded to the user. We will also pursue further architectural innovations to boost model quality and inference performance,” they state.
For now, the availability of these Zonos models under an Apache 2.0 license, including the first open-source SSM hybrid audio model, allows the audio industry to test and try the technology and the model API now available.
Inference code: github.com/Zyphra/Zonos
www.zyphra.com
Source link