Universal Speech and Acoustic Processing
This joint project of CMU and the National Institute of Advanced Industrial and Scientific Technology, Japan (
AIST) seeks to develop a speech and audio foundation model (self-supervised learning model) that can process more than 1,000 languages, as well as technologies such as speech recognition, speech emotion recognition, sound source separation, and speech synthesis, by leveraging CMU’s speech language processing and AIST’s acoustic signal processing experience. We will construct and publish a transparent speech and audio foundation model and dataset by clarifying the origin and nature of the speech and audio data, and the model training process, contributing to academic contributions, and creating technologies that could be easily used for industrial purposes.