Target Sound Extraction

Research Visualizations

Proposed Model Architecture

Audio Clips for Demonstration

Input

Single-target

GT

Single-target

Output

Single-target

Input

Multi-target

GT

Multi-target

Output

Multi-target

Overview

Imagine yourself in a bustling café, eager to hear your friend's speech amid the mixture of music, keyboard clatter, and ambient noise. Your brain effortlessly filters through these sounds, focusing solely on your friend's speech, aided by clues like their appearance and direction. What if we could train a deep learning model to do the same? Our research is dedicated to harnessing the potential of deep learning algorithms to precisely extract a specific sound from a complex audio mixture, regardless of its composition.

We're pushing the boundaries of sound extraction by leveraging advanced neural network techniques. Our goal is to develop robust models capable of isolating a target sound amidst a variety of sounds, even in challenging real-world environments with background noise and reverberation. In our recent work, we introduced a Transformer-based model designed specifically for extracting reverberant sounds.

Our approach builds on the Dense Frequency-Time Attentive Network (DeFT-AN) architecture, originally developed for speech enhancement tasks. This architecture generates a complex short-time Fourier transform (STFT) mask to separate clean speech from noisy, reverberant mixtures. To make DeFT-AN compatible with the target sound extraction task, we modify its architecture such that the embedding vector for the target class label can be fused in the middle of sequentially connected DeFT-A blocks constituting DeFT-AN.

The figures below illustrate our model architecture and the results of extracting reverberant target sounds, showcasing the effectiveness of our approach. We continue to refine our Transformer-based models to meet the challenges of real-world sound extraction using multiple clues.

Focus Areas

Target sound extraction using deep learning.

Related Publications

Navigating Target Sound Extraction With Effective Directional Clue Integration
Dayun Choi, Jung-Woo Choi•IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Barcelona, Spain, 2026•2026
Multichannel-to-Multichannel Target Sound Extraction Using Direction and Timestamp Clues
D. Choi and J-W. Choi•IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Hyderabad, India, Apr•2025
Target sound extraction on reverberant mixture using sound-class label and timestamp information
D. Choi, and J-W. Choi*•2024 Spring Conference of the Acoustical Society of Korea, Seogwipo, Korea, pp. 335-336, May•2024