Thursday, August 6, 2020 - 10:00am to 12:00pm


to take place via Zoom


Sai Krishna Rallabandi

Event Website:

For More Information, Contact:

Stacey Young,


Alan W Black, (Chair)
LP Morency
Eric H Nyberg
Kalika Bali, (Microsoft Research India)


Speech driven devices and interfaces like Apple Home pod, Google Home, Echo are increasingly becoming ubiquitous and have tremendous potential to affect our daily lives. However, deep learning models underlying these applications have yet unaddressed challenges like scalability, explainability and concerns like privacy and security.  Current models underlying these applications are built using a task based framework. With this design, challenges and concerns cannot be handled in a holistic way but need to be addressed per task. Addressing each challenge separately based on task leads to local and repetitive solutions without exploiting the global and task fluid nature of real life applications.

I propose a concept based framework called De-Entanglement that has linguistic concepts as first class objects rather than tasks. De-Entanglement attempts to build speech technology using two core concepts referred to as content and style. Specifically, content  encompasses acoustic phonetic information while style encompasses paralinguistic information from the raw audio. I argue that since all the individual tasks share these linguistic concepts, the solutions designed using De-Entanglement hold promise to be shared across tasks. In my dissertation I provide experiments that show how De-Entanglement can address three challenges in a holistic fashion:

Scalability: How to build speech technologies for new languages / language phenomena? Using code switching as the language phenomena, I demonstrate how De-Entanglement helps build more natural TTS voices.

Flexibility: How to build models that can be manipulated to accomplish a variety of functionality? I present experiments to show that De-Entanglement allows (a) detection of various paralinguistic phenomena and (b) accomplishing explicit global and local control of synthetic voices.

Explainability: How to build speech technology that has performance explanations? Using Language Identification from acoustics as the target application and acoustic unit discovery as an auxiliary task, I show how De-Entanglement can provide justifications for model predictions.

I conclude with case studies demonstrating the application of De-Entanglement to other modalities beyond speech.

A copy of the thesis proposal is at the following link.


LTI PhD Thesis Proposal