Carnegie Mellon University

Stock image of a green text bubble on a yellow background

Improving Techniques of Automatic Speech Recognition and Transfer Learning using Documentary

By Shinji Watanabe and Lori Levin

Computational tools such as automatic speech recognition (that is, the conversion of speech to text), are increasingly used to facilitate and mediate communication. Doctors speak into their computers, which transcribe their speech into legible written summaries; online virtual assistants have become ubiquitous in support networks in a wide range of situations; and end users increasingly expect their speech to be understood, processed, and acted upon by cell phones, navigation devices, and tools such as Alexa. The creation of such mechanisms, however, is currently dependent upon a large amount of training data (speech and text) that is only available for major languages. It is quite challenging to develop speech recognition systems when only 10 hours of transcribed audio is available. One way of addressing this problem is through transfer learning, in which a speech recognizer is trained on a relatively large amount of data for one endangered language (> 50 hours of transcribed audio) is then extended to related languages for which only a small corpus of material will be developed (10 hours of transcribed audio and 90 hours of untranscribed audio). The objectives of this project are both theoretical and substantive. For the first, this project advances the development of natural language processing for low-resource languages and establishes a protocol for extending this to other related languages. Substantively, this project produces an unprecedented corpus of transcribed audio for five related languages, facilitating the comparative study of these languages by theoretical and descriptive linguists. The data and findings will be available at Linguistic Data Consortium at the University of Pennsylvania, and Sam Noble Oklahoma Museum of Natural History, University of Oklahoma.

State-of-the-art automatic speech recognition (ASR) depends upon the existence of a corpus of material (audio recordings with time-coded transcriptions) and the application of artificial intelligence systems that utilize neural networks to replicate humans learning by interpreting raw data. This present project employs what is called an "end-to-end neural network." Effectively, the artificial neural network is presented with input data (the acoustic speech signal) and a prepared the end result (a transcription) and learns to achieve the same result. To accomplish this, the original corpus is divided into training (~ 80%), validation (~10%), and test (~10%) sets. For endangered language documentation the goal is not simply accuracy of the ASR system but also the reduction of human effort to achieve highly accurate time-coded transcriptions that will be archived as a permanent record of target language. The project team has already developed a highly accurate system for one phonologically difficult tonal language (character error rate <8%) and reduced the human effort required to produce an accurate time-coded transcription by > 75% (from 40 hours needed by a human starting from scratch to 9 hours needed by a human proofing a transcription generated by ASR). For this project the same team will explore ASR strategies for a morphologically complex agglutinative language in the hope of achieving the same degree of accuracy and reduction in human effort. This project will also address another challenge for state-of-the-art ASR: The transfer of an effective system developed for one language to low-resource, virtually undocumented related languages. Should the project be successful it will serve as a model for similar efforts with other languages and language groups.