Natural Language Processing and Computational Linguistics

Unsupervised Transcription of Early Modern Documents

While digital humanities research has taken great strides in the past decade, most digital humanities work has focused on contemporary data that is already in digital form – for example, the blogosphere, Twitter, and news. When research does explore historical data it is restricted to documents that can be accurately transcribed into text, and, as a result, is biased towards particular time periods. In particular, the 400 years just after the invention of the printing press – the early modern period – represents a critical dark period for digital humanities research because early modern documents are notoriously hard to transcribe into text with automatic methods. For example, Google’s open-source Tesseract system incorrectly transcribes more than half of words in a dataset of early modern English court proceedings. As a result, research is limited to the relatively small number of early modern documents that have been transcribed manually and most of the recorded data from this period remains effectively locked. We propose to develop transcription technology using a fundamentally new model that specifically targets early modern printed documents. We propose to address the problem posed by these documents by inducing font and text structure directly from unannotated document images in an unsupervised fashion. The key idea of our approach is that while properties like font and text structure are document-specific and therefore difficult to treat generally with supervised approaches, these phenomena are in fact regular within individual documents. Models that leverage this regularity by incorporating it as an assumption can constrain the otherwise difficult unsupervised learning problem and make it feasible. By making it possible to transcribe this data, the project will open up new possibilities for research in early modern studies, as well as having the potential for further applications in speech recognition, music transcription, and decipherment.