Carnegie Mellon University

Stock photo in black and white of a pair of hands making a rectangular shape

Flexible Deep Speech Synthesis through Gestural Modeling

By Shinji Watanabe and Alan Black

Voice based interactions have become the norm everywhere from cars, to mobile phones to digital home assistants. As speech based machine interaction becomes more pervasive, there is increased demand and expectation of human-like performance and personality from these systems. It is important for the machine to deliver responses about the weather on a pleasant sunny day or an impending hurricane in an appropriate manner. Machines need to be able to respond sympathetically or emphatically depending on the context of their use. Critically, when machines fail, they should do so in human understandable ways, so that there are no unintended consequences of technology. This project aims to create more natural and flexible speech synthesis technology that is inspired by human strategies and mechanisms for speech production. Bringing together the science of speech production and current state-of-the-art engineering speech systems, this project aims to impart explainability, naturalness and flexibility to speech technologies. This project has the potential to impact all systems that use speech output like automated tutoring, interactive voice response, speech translation in commercial and military settings, digital assistants, robotics and rehabilitative healthcare applications like Brain-Computer Interfaces.

Current speech synthesis techniques are focused on end-to-end systems, avoiding explicit modeling of internal structure of the speech signal. Consequently, such systems may have good results but fail to allow any generalization beyond their recorded databases. This project concentrates on incorporating aspects of human speech production into computer speech synthesis. Using data-driven techniques and vocal tract imaging datasets, the project aims to discover and model compositional aspects of the speech signal as described by Articulatory Phonology. Novel deep-learning based approaches will be developed for joint optimization of diverse speech representations such as acoustic, phonological and physiological data within an analysis-by-synthesis framework. New strategies will be developed for incorporating grounded representations into text-to-speech training and evaluated in a range of applications in flexible speech synthesis.