Meta Unveils Voicebox: Next-Gen Voice Synthesis Model
Meta Platform’s AI research division has unveiled Voicebox, a groundbreaking machine learning model capable of generating speech from text. Unlike traditional text-to-speech models, Voicebox showcases remarkable versatility by effortlessly tackling various tasks, including editing, noise removal, and style transfer, even without specific training.
The model’s training process employed a unique methodology devised by Meta researchers. Although Meta has refrained from releasing Voicebox to address ethical apprehensions regarding potential misuse, the initial findings are highly encouraging and hold tremendous potential for a wide range of applications in the times ahead.
Voicebox is a generative model that can synthesize speech across six languages, including English, French, Spanish, German, Polish, and Portuguese. Like large language models, it has been trained on a very general task that can be used for many applications. But while LLMs try to learn the statistical regularities of words and text sequences, Voicebox has been trained to learn the patterns that map voice audio samples to their transcripts.
Such a model can then be applied to many downstream tasks with little or no fine-tuning. “The goal is to build a single model that can perform many text-guided speech generation tasks through in-context learning,” Meta’s researchers write in their paper (PDF) describing the technical details of Voicebox.
The model was trained Meta’s “Flow Matching” technique, which is more efficient and generalizable than diffusion-based learning methods used in other generative models. The technique enables Voicebox to “learn from varied speech data without those variations having to be carefully labeled.” Without the need for manual labeling, the researchers were able to train Voicebox on 50,000 hours of speech and transcripts from audiobooks.
The model uses “text-guided speech infilling” as its training goal, which means it must predict a segment of speech given its surrounding audio and the complete text transcript. Basically, it means that during training, the model is provided with an audio sample and its corresponding text. Parts of the audio are then masked and the model tries to generate the masked part using the surrounding audio and the transcript as context. By doing this over and over, the model learns to generate natural-sounding speech from text in a generalizable way.
Replicating voices across languages, editing out mistakes in speech, and more
Unlike generative models that are trained for a specific application, Voicebox can perform many tasks that it has not been trained for. For example, the model can use a two-second voice sample to generate speech for new text. Meta says this capability can be used to bring speech to people who are unable to speak or customize the voices of non-playable game characters and virtual assistants.
Voicebox also performs style transfer in different ways. For example, you can provide the model with two audio and text samples. It will use the first audio sample as style reference and modify the second one to match the voice and tone of the reference. Interestingly, the model can do the same thing across different languages, which could be used to “help people communicate in a natural, authentic way — even if they don’t speak the same languages.”
The model can also do a variety of editing tasks. For example, if a dog barks in the background while you’re recording your voice, you can provide the audio and transcript to Voicebox and mask out the segment with the background noise. The model will use the transcript to generate the missing portion of the audio without the background noise.
The same technique can be used to edit speech. For example, if you have misspoken a word, you can mask that portion of the audio sample and pass it to Voicebox along with a transcript of the edited text. The model will generate the missing part with the new text in a way that matches the surrounding voice and tone.
One of the interesting applications of Voicebox is voice sampling. The model can generate various speech samples from a single text sequence. This capability can be used to generate synthetic data to train other speech processing models. “Our results show that speech recognition models trained on Voicebox-generated synthetic speech perform almost as well as models trained on real speech, with 1 percent error rate degradation as opposed to 45 to 70 percent degradation with synthetic speech from previous text-to-speech models,” Meta writes.
Voicebox has limits too. Since it has been trained on audiobook data, it does not transfer well to conversational speech that is casual and contains non-verbal sounds. It also doesn’t provide full control over different attributes of the generated speech, such as voice style, tone, emotion, and acoustic condition. The Meta research team is exploring techniques to overcome these limitations in the future.