Single-Channel Distance-Based Source Separation for Mobile GPU in Outdoor and Indoor Environments


Abstract

This study emphasizes the significance of exploring distance-based source separation (DSS) in outdoor environments. Unlike existing studies that primarily focus on indoor settings, the proposed model is designed to capture the unique characteristics of outdoor audio sources. It incorporates advanced techniques, including a two-stage conformer block, a linear relation-aware self-attention (RSA), and a TensorFlow Lite GPU delegate. While the linear RSA may not capture physical cues as explicitly as the quadratic RSA, it enhances the model’s context awareness, leading to improved performance on the DSS that requires an understanding of physical cues in outdoor and indoor environments. The experimental results demonstrated that the proposed model overcomes the limitations of existing approaches and considerably enhances energy efficiency and real-time inference speed on mobile devices.

Index Terms:
distance-based source separation, TFLite GPU delegate, relative-aware self-attention, conformer, linear attention, on-device
Demo page of DSS







1. Comparison for all architectures using the simulated samples

All samples can be found in "samples/sim" directory.


2mix Indoor

Near 1 / Far 1
RT60 (s) GT_near GT_far GT_mix Proposed near Proposed far TF_DPRNN near TF_DPRNN far Baseline near Baseline far
0.15
0.40
1.00

3mix Indoor

Near 1 / Far 2
RT60 (s) GT_near GT_far GT_mix Proposed near Proposed far TF_DPRNN near TF_DPRNN far Baseline near Baseline far
0.15
0.40
1.00







2. Test results of two proposed models trained using the datasets mixed in 100:0 and 60:40 ratios, respectively.

All samples can be found in "samples/sim" directory.


2mix Indoor

Near 1 / Far 1
RT60 (s) GT_near GT_far GT_mix (100:0) near (100:0) far (60:40) near (60:40) far
0.15
0.40
1.00

2mix Outdoor

(* We don't care about the values of RT60 in outdoor environments)
Near 1 / Far 1
RT60 (s) GT_near GT_far GT_mix (100:0) near (100:0) far (60:40) near (60:40) far
0.15
0.40
1.00

3mix Indoor

Near 1 / Far 2
RT60 (s) GT_near GT_far GT_mix (100:0) near (100:0) far (60:40) near (60:40) far
0.15
0.40
1.00

3mix Outdoor

(* We don't care about the values of RT60 in outdoor environments)
Near 1 / Far 2
RT60 (s) GT_near GT_far GT_mix (100:0) near (100:0) far (60:40) near (60:40) far
0.15
0.40
1.00







3. Real data samples.


All samples can be found in "samples/real" directory.


S1 Indoor - Near 1 (male) / Far 1 (male), Quiet room

S2 Indoor - Near 1 (male) / Far 2 (male & female), High reverb.

S3 Indoor - Near 1 (female) / Far 1 (female), High reverb.

S4 Outdoor - Near 1 (male) / Far 1 (male), Wind noise (* Figure 5 in paper)

Sample Input (100:0) near (100:0) far (60:40) near (60:40) far
S1
S2
S3
S4