Microsoft's AI Program Can Clone Voices From 3-Second Audio Clips

Microsoft's AI Program, VALL-E, has made a breakthrough in speech synthesis, with the ability to clone voices after only listening to a person speak for 3 seconds. The technology is designed for text-to-speech synthesis and is capable of replicating human-like speech patterns.

The team of researchers at Microsoft developed VALL-E by training the system to listen to over 60,000 hours of English audiobooks narrated by more than 7,000 different speakers. This vast sample is hundreds of times larger than other text-to-speech programs previously developed.

PCMag reports that Microsoft has published several VALL-E demos on a website that showcases its capabilities. The AI program not only clones a person's voice using a 3-second audio clip but can also manipulate the cloned voice to say anything desired.

Furthermore, the program can replicate the emotions conveyed in a person's voice or configure it to adopt different speech styles.

Voice cloning is not a new concept, but Microsoft's research stands out for making it easy to replicate anyone's voice with a short audio clip. This has raised concerns about the potential misuse of the technology in cybercrime, a concern acknowledged by Microsoft's research team.

"As VALL-E can synthesize speech that preserves the speaker's identity, it can bring the potential risks of misuse, such as impersonating a specific speaker or faking voice identification," the researchers wrote in their paper.

The team is considering building a program that can distinguish between audio clips synthesized by VALL-E. The program interprets audio speech as "discrete tokens," which it then reproduces to speak different texts.

The technology is still far from perfect. Microsoft's research team notes that VALL-E sometimes struggles or fails to pronounce certain words. At other times, the words can sound artificially synthesized, robotic, or toneless.

"Even if we use 60,000 hours of data for training, it still cannot cover the voice of everyone, especially those who speak with an accent," the team explained. "In addition, the diversity of speech styles is not enough, as LibriLight (the audio VALL-E has been actively trained on) is a collection of audio books, where most speech is in a reading style."

However, the research shows that creating a more accurate voice cloning program can be achieved by training it on more audio clips. Microsoft has not released VALL-E to the public, most likely to prevent misuse.