2008
Viktor Pekar, Ruslan Mitkov,
Dimitar Blagoev and Andrea Mulloni. 2008. Finding Translations for
Low-Frequency Words in Comparable Corpora. Machine Translation,
Amsterdam, Springer. (digital copy available, hard copy to appear)
2007
Viktor Pekar, Diana Inkpen and Andrea Mulloni. 2007. Proceedings of the
1st International Workshop on Acquisition and Management of
Multilingual Lexicons, Borovets, Bulgaria.
Andrea Mulloni, Viktor Pekar, Ruslan Mitkov and Dimitar Blagoev. 2007.
Semantic Evidence for Automatic Identification of Cognates. Proceedings
of the 1st International Workshop on Acquisition and Management of
Multilingual Lexicons, Borovets, Bulgaria, 49-54.
Andrea Mulloni. 2007. Automatic Prediction of Cognate Orthography Using
Support Vector Machines. Proceedings of the Student Research Workshop, ACL 2007, Prague, Czech Republic, 25-30.
Viktor Pekar, Ruslan Mitkov, Dimitar Blagoev and Andrea Mulloni. 2007.
Finding Translations for Low-Frequency Words in Comparable Corpora.
Proceedings of the CONTEXT-07 Workshop on "Contextual Information in
Semantic Space Models" (CoSMo-2007), Roskille, Denmark.
2006
Andrea Mulloni and Viktor Pekar. 2006. Automatic Detection of
Orthographic Cues for Cognate Recognition. Proceedings of LREC 2006,
Genua, Italy. (here is the poster)
Additional Reviewer @ CICling '08
OC @ AMML '07
PC @ AMML '07
PC @ RANLP '07
PC @ CALP '07
PC @ ACL Student Research Workshop '07
PC @ RANLPMT '07
PC @ EACL Student Research Workshop '07
PC @ NLDB '07
PhD Research Topic
Present-day machine translation technologies crucially depend
on the size and quality of lexical
resources. Vendors of MT software typically supply a wide range of
specialized lexicons as well as tools for their customization. In
practice, however, having a lexical repository truly appropriate for a
specific task is problematic: it is still very difficult to find a
lexicon for the
required language pair and/or topic area. In this situation, a lot of
recent
NLP research has been focusing its efforts on finding ways to
(semi-)automatically
induce lexical knowledge from text corpora.
The overall aim of the proposed thesis project is to
investigate the problem of acquisition
of bilingual lexicons from non-parallel corpora. In recent years this
problem has been
receiving a substantial interest of the NLP community. This interest is
motivated by the
fact that parallel corpora are seldom available for the required
languages and domains;
for typologically distant languages (like English and Chinese),
parallel corpora with
sufficient alignment resolution are especially hard to obtain. Large
monolingual text
collections, on the contrary, are becoming ever more easily accessible,
particularly
with the increasing multilinguality of the Web. To find ways to exploit
the semantic
potential hidden in these vast amounts of text for various multilingual
tasks is an
extremely appealing idea.
In the first place, the research project will aim to gain
insight in the overall
direction of the comparable corpora community. Building on this solid
ground, an
extensive knowledge of the scientific background will be gained, and
the development of
an own approach with an acceptable success ratio will be addressed.
Consequently, the
project aims to open new routes and to deliver a solution which will
serve as a stable
starting point for further research. In such a context, a crucial task
will be the development
of an extensively applicable method, which will deliver reliable
results from a great range of text types.
The ultimate aim of this work is to serve the linguistic as
well as the language business
communities in their strive for comprehensive, reliable, automated and
- above all - fast
information retrieval from large amounts of texts, in order to deliver
highly customizable
data which include the very latest linguistic developments and is
easily accessible to all individuals.