Google’s deep learning audio-visual model can enhance the voice of a single person in a video and mute other background noise.
Building a slide deck, pitch, or presentation? Here are the big takeaways:
- Google researchers unveiled a deep learning audio-visual model for isolating a single speech signal from a mix of sounds, including other voices and background noise.
- The model has potential applications in speech enhancement and recognition in videos and in video conferencing.
When you find yourself in a noisy conference hall or networking event, it’s usually pretty easy to focus your attention on the particular person you’re talking to, while mentally “muting” the other voices and sounds in the area. This capability—known as the cocktail party effect—comes naturally to humans, but has remained a challenge for computers in terms of automatically separating an audio signal into its individual speech sources.
At least, until now: Google researchers have developed a deep learning audio-visual model for isolating a single speech signal from a mix of sounds, including other voices and background noise. As detailed in a new paper, the researchers were able to create videos with a computer in which specific people’s voices are enhanced, while all other sounds are toned down.