Carnegie Mellon University

Big Multilinguality for Data-Driven Lexical Semantics

Natural Language Processing and Computational Linguistics

By Christopher Dyer

& Noah Smith

This project goes beyond traditional natural language processing approaches by extending the types of contexts used in constructing semantic vectors. First, this project incorporates translation contexts (i.e., words readily available in multilingual parallel corpora) with traditional monolingual corpora. This allows evidence-sharing across languages, most importantly from resource-rich languages with large corpora to more resource-poor languages. This project also incorporates social context inferable from social network platforms, captured through author, time, geographic and social connection metadata. Taken together, these additional features give a broader definition of a word's context and lead to a more unified approach to the distributional approach to modeling human language, moving in the direction of a language-independent semantics.