Improving Techniques of Automatic Speech Recognition and Transfer Learning using Documentary
By Shinji Watanabe and Lori Levin
State-of-the-art automatic speech recognition (ASR) depends upon the existence of a corpus of material (audio recordings with time-coded transcriptions) and the application of artificial intelligence systems that utilize neural networks to replicate humans learning by interpreting raw data. This present project employs what is called an "end-to-end neural network." Effectively, the artificial neural network is presented with input data (the acoustic speech signal) and a prepared the end result (a transcription) and learns to achieve the same result. To accomplish this, the original corpus is divided into training (~ 80%), validation (~10%), and test (~10%) sets. For endangered language documentation the goal is not simply accuracy of the ASR system but also the reduction of human effort to achieve highly accurate time-coded transcriptions that will be archived as a permanent record of target language. The project team has already developed a highly accurate system for one phonologically difficult tonal language (character error rate <8%) and reduced the human effort required to produce an accurate time-coded transcription by > 75% (from 40 hours needed by a human starting from scratch to 9 hours needed by a human proofing a transcription generated by ASR). For this project the same team will explore ASR strategies for a morphologically complex agglutinative language in the hope of achieving the same degree of accuracy and reduction in human effort. This project will also address another challenge for state-of-the-art ASR: The transfer of an effective system developed for one language to low-resource, virtually undocumented related languages. Should the project be successful it will serve as a model for similar efforts with other languages and language groups.