Carnegie Mellon University

Photo of a sound mixing board from a digonal angle, close-up

Speech Enhancement

By Shinji Watanabe

Thanks to the matured technologies in speech recognition and enhancement, the technological trend of speech processing has moved from constrained conversation scenarios (e.g., single-speaker speech recording with simulations) to unconstrained/everyday conversation scenarios. In these scenarios, we have to deal with 1) multi-speaker setups with overlaps, 2) dynamic sensor and speaker environment, and 3) the difficulty of obtaining supervised real recording data. This proposal tackles these challenging problems by developing novel unsupervised or weakly supervised speech enhancement and separation techniques. We target various real-scene applications, including wearable devices and smart AI speakers. The experiments are conducted by controlling the level of the difficulties from simulation to real environments to clarify our ablation studies. Thus, exploring the correct simulation setup of various audio scenes, including multichannel, multi-speakers, and moving targets/sources, is also a part of our research items.

Unsupervised Neural Speech Separation by Leveraging Unlabeled Over-determined Training Mixtures (UNSSOR)

Deep learning-based supervised learning has shown strong performance in speech separation in simulated conditions. One can use pairs of the simulated noisy-reverberant multi-speaker mixture and clean speech signals for model training. However, it is challenging to simulate training data with a distribution matched with real-recorded test data, and the resulting trained models often produce unsatisfactory performance. This issue can be addressed by training unsupervised speech separation models based on real-recorded mixtures. Although it is well-known that unsupervised speech separation is an ill-posed problem with an infinite number of solutions, our insight is that, for over-determined training mixtures where there are more microphones than speakers, the ill-posed problem can be turned into a well-posed one where there exists a unique solution that is most consistent with the multi-channel mixture. In this proposal, we will design deep neural networks to find the most consistent solution based on real-recorded over-determined mixtures and realize unsupervised neural speech separation in single- and multi-microphone scenarios.

Weakly Supervised Speech Enhancement via Speech Recognition Objectives

Our group has been actively working on joint modeling of speech enhancement and recognition with an end-to-end neural network and showing state-of-the-art performance in several noisy speech recognition benchmarks. As a side effect of this approach, we found that the speech recognition objectives help optimize a speech enhancement network only with transcriptions. This function has significant potential since obtaining the pair data of real noisy and clean speech data is almost impossible, while obtaining the real noisy data and corresponding transcriptions is relatively easy. In this proposal, we further explore this direction in the dynamic environment by considering the changes in the number of speakers and speaker-sensor movement in a multi-channel setup. We will also plan to combine powerful speech enhancement and recognition techniques, including the above UNSSOR method in speech enhancement and self-supervised learning representations, with an adaptive/online training capability in speech recognition. In addition to the transcription target, we will consider the other metadata (e.g., speakers, intentions, emotions) and pseudo-label target obtained by various self-supervised learning models as weak supervision.