Monday, July 27, 2020 - 3:00pm to 5:00pm
Location:to take place via Zoom
For More Information, Contact:Stacey Young, firstname.lastname@example.org
Graham Neubig, (co-chair)
Taylor Berg-Kirkpatrick, (co-chair)
Kevin Gimpel, (Toyota Technological Institute at Chicago)
Representation learning has had a tremendous impact in machine learning and natural language processing (NLP), especially in recent years. Learned representations provide useful features needed for downstream tasks, allowing models to incorporate knowledge from billions of tokens of text. The result is better performance and generalization on many important problems of interest. Often these representations can also be used in an unsupervised manner to determine the degree of semantic similarity of text or for finding semantically similar items, the latter useful for mining paraphrases or parallel text. Lastly, representations can be probed to better understand what aspects of language have been learned, bringing an additional element of interpretability to our models.
This thesis focuses on the problem of learning paraphrastic representations for units of language. These units span from sub-words, to words, to phrases, and to full sentences – the latter being a focal point. Our primary goal is to learn models that can encode arbitrary word sequences into a vector with the property that sequences with similar semantics are near each other in the learned vector space, and that this property transfers across domains.
We first show several effective and simple models, paragram and charagram, to learn word and sentence representations on noisy paraphrases automatically extracted from bilingual corpora. These models outperform contemporary and more complicated models on a variety of semantic evaluations.
We then propose techniques to enable deep networks to learn effective semantic representations, addressing a limitation of our prior work. We found that in order to learn representations for sentences with deeper, more expressive neural networks, we need large amounts of sentential paraphrase data. Since this did not exist yet, we utilized neural machine translation models to create ParaNMT-50M, a corpus of 50 million English paraphrases which has found numerous uses by NLP researchers, in addition to providing further gains on our learned paraphrastic sentence representations.
We next propose models for bilingual paraphrastic sentence representations. We first propose a simple and effective approach that outperforms more complicated methods on cross-lingual sentence similarity and mining bitext, and we also show that we can also achieve strong monolingual performance without paraphrase corpora by just using parallel text. We then propose a generative model capable of concentrating semantic information into our embeddings and separating out extraneous information by viewing parallel text as two different views of a semantic concept. We found that this model has improved performance on both monolingual and cross-lingual tasks. Lastly, we extend this bilingual model to the multilingual setting and show it can be effective on multiple languages simultaneously, significantly surpassing contemporary multilingual models.
Finally, this thesis concludes by showing applications of our learned representations and ParaNMT-50M. The first of these is on generating paraphrases with syntactic control for making classifiers more robust to adversarial attacks. We found that we can generate a controlled paraphrase for a sentence by supplying just the top production of the desired constituent parse – and the generated sentence will follow this structure, filling in the rest of the tree as needed to create the paraphrase. The second application is applying our representations for fine-tuning neural machine translation systems using minimum risk training. The conventional approach is to use BLEU (Papineni et al., 2002), since that is what is commonly used for evaluation. However, we found that using an embedding model to evaluate similarity allows the range of possible scores to be continuous and, as a result, introduces fine-grained distinctions between similar translations. The result is better performance on both human evaluations and BLEU score, along with faster convergence during training.
A copy of the defense thesis can be found here.