books image

Publications

2008

Viktor Pekar, Ruslan Mitkov, Dimitar Blagoev and Andrea Mulloni. 2008. Finding Translations for Low-Frequency Words in Comparable Corpora. Machine Translation, Amsterdam, Springer. (digital copy available, hard copy to appear)


2007

Viktor Pekar, Diana Inkpen and Andrea Mulloni. 2007. Proceedings of the 1st International Workshop on Acquisition and Management of Multilingual Lexicons, Borovets, Bulgaria.

Andrea Mulloni, Viktor Pekar, Ruslan Mitkov and Dimitar Blagoev. 2007. Semantic Evidence for Automatic Identification of Cognates. Proceedings of the 1st International Workshop on Acquisition and Management of Multilingual Lexicons, Borovets, Bulgaria, 49-54.

Andrea Mulloni. 2007. Automatic Prediction of Cognate Orthography Using Support Vector Machines. Proceedings of the Student Research Workshop, ACL 2007, Prague, Czech Republic, 25-30.

Viktor Pekar, Ruslan Mitkov, Dimitar Blagoev and Andrea Mulloni. 2007. Finding Translations for Low-Frequency Words in Comparable Corpora. Proceedings of the CONTEXT-07 Workshop on "Contextual Information in Semantic Space Models" (CoSMo-2007), Roskille, Denmark.

2006
Andrea Mulloni and Viktor Pekar. 2006. Automatic Detection of Orthographic Cues for Cognate Recognition. Proceedings of LREC 2006, Genua, Italy. (here is the poster)

Committees

Additional Reviewer @ CICling '08
OC @ AMML '07
PC @ AMML '07
PC @ RANLP '07
PC @ CALP '07
PC @ ACL Student Research Workshop '07
PC @ RANLPMT '07
PC @ EACL Student Research Workshop '07
PC @ NLDB '07

PhD Research Topic

Present-day machine translation technologies crucially depend on the size and quality of lexical resources. Vendors of MT software typically supply a wide range of specialized lexicons as well as tools for their customization. In practice, however, having a lexical repository truly appropriate for a specific task is problematic: it is still very difficult to find a lexicon for the required language pair and/or topic area. In this situation, a lot of recent NLP research has been focusing its efforts on finding ways to (semi-)automatically induce lexical knowledge from text corpora. 
The overall aim of the proposed thesis project is to investigate the problem of acquisition of bilingual lexicons from non-parallel corpora. In recent years this problem has been receiving a substantial interest of the NLP community. This interest is motivated by the fact that parallel corpora are seldom available for the required languages and domains; for typologically distant languages (like English and Chinese), parallel corpora with sufficient alignment resolution are especially hard to obtain. Large monolingual text collections, on the contrary, are becoming ever more easily accessible, particularly with the increasing multilinguality of the Web. To find ways to exploit the semantic potential hidden in these vast amounts of text for various multilingual tasks is an extremely appealing idea.
In the first place, the research project will aim to gain insight in the overall direction of the comparable corpora community. Building on this solid ground, an extensive knowledge of the scientific background will be gained, and the development of an own approach with an acceptable success ratio will be addressed. Consequently, the project aims to open new routes and to deliver a solution which will serve as a stable starting point for further research. In such a context, a crucial task will be the development of an extensively applicable method, which will deliver reliable results from a great range of text types.
The ultimate aim of this work is to serve the linguistic as well as the language business communities in their strive for comprehensive, reliable, automated and - above all - fast information retrieval from large amounts of texts, in order to deliver highly customizable data which include the very latest linguistic developments and is easily accessible to all individuals.