New Expressive Real-Time Text-to-Speech (TTS) Models Support High-Fidelity Voice Cloning

Real Hacker StaffFebruary 17, 2025

5 2 minutes read

New Expressive Real-Time Text-to-Speech (TTS) Models Support High-Fidelity Voice Cloning

Zyphra, a company based in Palo Alto, California, building a mew multimodal artificial intelligence agent system, announced the release of Zonos-v0.1 beta, a pair of extremely expressive text-to-speech (TTS) models with high fidelity voice cloning. The company is releasing both the transformer and hybrid TTS models with an Apache 2.0 license, which effectively makes these available for redistribution.

Zyphra, a company based in Palo Alto, California, is building a new multimodal artificial intelligence agent system and announced the release of Zonos-v0.1 beta, a pair of extremely expressive text-to-speech (TTS) models with high fidelity voice cloning. The company is releasing both the transformer and hybrid TTS models with an Apache 2.0 license, which effectively makes these available for redistribution.

A visit to the Zyphra website allows experimenting with the two new text-to-speech (TTS) models and testing the unique ability to generate high-fidelity voice cloning that can then be applied to read any text with outstanding expressive features. The website also shows how Zonos performs better than leading TTS providers in quality and expressiveness.

Why would humanity need voice cloning (or AI-music generation) is a question that we all should ask, but as with anything regarding artificial intelligence, research is moving way faster than we have time to ponder about the consequences. And the Zonos models clearly show that recognized voice actors and announcers will need to leverage their unique voice “model” instead of just their work. And that deepfakes will be a much larger problem than imagined given our reliance on media content.

According to Zyphra, Zonos offers flexible control of vocal speed, emotion, tone, and audio quality as well as instant unlimited high-quality voice cloning. Zonos natively generates speech at 44KHz and is able to clone any voice from just 5 to 30 seconds of speech. Zonos enables highly expressive and natural speech generation from text prompts given a speaker embedding or audio prefix. Zonos also can be conditioned based on speaking rate, pitch standard deviation, audio quality, and emotions such as sadness, fear, anger, happiness, and surprise.

The models are trained on approximately 200,000 hours of speech data, encompassing both neutral-toned speech (like audiobook narration) and highly expressive speech. The majority of the data is English, although there are substantial amounts of Chinese, Japanese, French, Spanish, and German just to improve its usability.

“We believe that openly releasing models of this caliber will significantly advance TTS research. Currently, Zonos is a beta preview. While highly expressive, Zonos is sometimes unreliable in generations, leading to interesting artifacts. We are excited to continue pushing the frontiers of conversational agent performance, reliability, and efficiency over the coming months,” the company says in its blog.

“Our highly optimized inference engine powers both the Zonos API and playground, achieving impressive time-to-first-audio (TTFA) metrics. The hybrid model demonstrates particularly efficient performance characteristics, with reduced latency and memory overhead compared to its transformer counterpart, thanks to its Mamba2-based architecture that relies less heavily on attention blocks.”

“In future model releases, we aim to significantly improve the model’s reliability, its ability to handle specific pronunciations, the number of supported languages, and the level of control over emotions and other vocal characteristics afforded to the user. We will also pursue further architectural innovations to boost model quality and inference performance,” they state.

For now, the availability of these Zonos models under an Apache 2.0 license, including the first open-source SSM hybrid audio model, allows the audio industry to test and try the technology and the model API now available.

Inference code: github.com/Zyphra/Zonos

www.zyphra.com

Source link

Uber and Waymo’s commercial robotaxi service is open for business in Atlanta

This Is Why High-End Electric Cars Are Failing

Beavis & Butt-Head coming to Call of Duty in bizarre crossover

Slash and Burn Approach to UN Reforms Under Fire — Global Issues

The Methaphone Is a Phone (That’s Not a Phone) to Help You Stop Using Your Phone

Your Samsung phone has a secret Wi-Fi menu. Here’s how to find it

Hidden detail in 28 Years Later solves Slow Low mystery

Google removes popular and useful feature from the Pixel Camera app

Russia and Ukraine swap drone attacks as ceasefire efforts remain stalled | Russia-Ukraine war News

U.S. House Bans WhatsApp on Official Devices Over Security and Data Protection Issues

New Expressive Real-Time Text-to-Speech (TTS) Models Support High-Fidelity Voice Cloning

Real Hacker Staff

An African Pioneer — Global Issues

W.A.G.s to Riches cast: Who’s on the new Netflix series

Animated Lord Of The Rings Movie Gets New Blu-Ray Edition Next Week

Here’s when the vivo X200 FE will launch in India

League of Legends update 25.13 patch notes

Quordle today – my hints and answers for Friday, December 27 (game #1068)

NYT Strands today — my hints, answers and spangram for Friday, December 27 (game #299)

Huawei Pura 80 Ultra’s cameras detailed

This app’s popularity in the Play Store means many of you are getting an iPhone for the holidays

Random: Sonic 3’s Movie Writers Would Love To Adapt Zelda: Wind Waker To The Big Screen

These power station deals from Anker, Jackery, and DJI are hot!

Can sim drivers make the shift to F1? Max Verstappen thinks so

Best Avowed mods: Extra ability points, increased carry weight, easy breathing, more

Related Articles