Recent advances of machine learning models for natural language processing (NLP) tasks have encouraged research on language generation and understanding over diverse application scenarios. However, the success of these models is mostly driven by supervised learning approaches on a large amount of labeled data. More importantly, supervised machine learning approaches usually impose an ideal assumption that the training and test data both come from a single source. This assumption usually fails in practice especially when NLP models are deployed to deal with real text data coming from diverse domains (e.g., social media or medical articles), being written in different languages (e.g., English or Japanese), or even in different data modalities (e.g., structural graphs or images). These data discrepancy issues at the training and testing stages challenge the generalization ability of supervised learning models in many NLP applications over diverse domains, languages, and modalities.
In this thesis, with the goal of improving generalization ability of NLP models to alleviate the aforementioned discrepancies, we exploit indirect supervision from widely-available raw data and sparse human feedback to train neural NLP models (e.g., neural machine translation, contextualized language models), and provide evaluation methods for examining the generalization capabilities of NLP models over diverse application scenarios. This thesis consists of three parts. The first part investigates semi-supervised and unsupervised adaptation approaches for neural machine translation, a popular language generation task, to remedy the domain shift problem. In the second part, we evaluate the zero-shot cross-lingual generalization ability of pre-trained language models that are trained on annotated text in English, and tested on non-English text. We further propose two explicit alignment-based objectives for training multilingual representations in a shared embedding space on multilingual parallel corpora which are widely-available to train neural machine translation models. Finally, in the third part, we propose methods to integrate other data modalities, such as structural graphs and visual data, into neural language generation models with the goal of learning from indirect multi-modal supervision.