Facebook parent company Meta released the first demo for its new AI-powered audio generator platform, Audiobox, on Monday. The social media giant said Audiobox lets users create custom voices and sound effects using voice inputs and prompts.
Audiobox, Meta said, builds on the technology developed for its Voicebox platform introduced earlier this year, but it surpasses Voicebox in quality and includes automatic watermarking for “responsible use.”
“Audiobox, the successor to Voicebox, is advancing generative AI for audio even further by unifying generation and editing capabilities for speech, sound effects (short, discrete sounds like a dog bark, car horn, a crack of thunder, etc.), and soundscapes, with a variety of input mechanisms to maximize controllability for each use case,” Meta’s Audiobox team said.
Audiobox, the team explained, uses “bespoke solvers,” which they claim makes the generation process over 25 times faster than previous models without loss of performance.
In June, Meta announced Voicebox, a generative AI tool Meta said can produce audio in six languages, including English, French, German, Spanish, Polish, and Portuguese, and can do so closer to how people speak naturally in the real world.
With concerns about AI-powered deepfakes rising at the time, Meta said it would not release Voicebox to the public, acknowledging the potential for misuse. To combat misuse with Audiobox, Meta included watermarking.
“Recent advancement in quality and fidelity in the audio generative model has empowered novel applications and use [cases] on the model. However, at the same time, there are many people… raising concerns about the risks of misuse,” the Audiobox team said in its report. “Therefore, the ability to recognize which audio is generated or real is crucial to prevent the [misuse] of the technology and enable certain [platforms] to comply with their policy.”
“Both the Audiobox model and our interactive demo feature automatic audio watermarking so any audio created with Audiobox can be accurately traced to its origin,” Meta said. “Our watermarking method embeds a signal into the audio that’s imperceptible to the human ear but can be detected all the way down to the frame level using a model capable of finding AI-generated segments in [the] audio.”
“We design description-based and example-based prompting to enhance controllability and unify speech and sound generation paradigms,” the team said. “We allow transcript, vocal, and other audio styles to be controlled independently when generating speech.”
While it may be faster, Meta acknowledged that audio-generative AI models like Audiobox are limited by the amount of training data—in this case, sounds—labeled and fed into the AI model, emphasizing the importance of correctly labeling data.
An example, the researchers said, labeling the sounds of a chihuahua and a labrador barking as the specific dog type is preferable to simply labeling it as “dog barking.” Meta says the same applies to speech patterns like accents and regional dialects.
A Meta spokesperson declined to provide further comment.
Like Google, Microsoft, and Amazon, Meta has invested heavily in artificial intelligence. Earlier this month, Meta announced over 20 new AI-powered features coming to its suite of platforms, including Facebook, Instagram, and WhatsApp.
A proponent of responsible AI development, Meta recently partnered with IBM to launch the AI Alliance, a consortium of over 50 companies, universities, and think tanks focused on open-source AI innovation and development.
“The AI Alliance brings together researchers, developers, and companies to share tools and knowledge that can help us all make progress whether models are shared openly or not,” President of Global Affairs of Meta Nick Clegg said. “We’re looking forward to working with partners to advance the state-of-the-art in AI and help everyone build responsibly.”
Edited by Ryan Ozawa.