Search Solutions Tutorial on Natural Language Processing.

Dr Michael Oakes

Search Solutions is an annual event run by the Information Retrieval Specialist Group, the section of the British Computer Society which has a special interest in search engines. This year it took place on Wednesday 24th November, and was held online for invited speakers from industry to talk about their work in information retrieval. The British Computer Society has new offices at 25 Copthall Avenue in London, near the Bank of England. On the day before, a series of tuorials designed to introduce people to related topics were held, such as one given by Ingo Frommholz from our own computing department on search engine evaluation.

The tutorial on Natural Language Processing was given by myself. Unlike the others, it was an all-day event, and held face-to-face. After having had experience of online teaching during the pandemic, I know that I prefer the closer interaction with the students which comes with face-to-face teaching.

The contents of the tutorial were almost the same as the first three weeks of lectures that I give on the MA Computational Linguistics module in RIILP. I used the structure of the textbook by Jurafsky and Martin as a skeleton, but brought in other things such as the practical exercises  from the Edinburgh Textbooks in Empirical Linguistics on stemming and automatic part of speech tagging. Stemming covers techniques for regarding different grammatical forms of a word as being related to each other, and part-of-speech tagging is assigning a part-of-speech category (such as noun or verb) to each word in the input sentence. I used the first edition of Jurafsky and Martin to open the discussion with a short dialogue between Dave the astronaut and HAL the computer from the film “2001 – A Space Odyssey”. What natural language techniques would HAL need to know to carry out this conversation?

At the event, I was pleased to see some old friends in the audience, including Ingo in the morning, before his own workshop began.  

More details are available at:

VC Award Logo

Winners of the Vice-Chancellor’s Awards for Staff Excellence

Earlier this week the Vice-Chancellor’s Awards for Staff Excellence took place. Our Admin Team – April, Suman, Kate, Amanda and Emma – were nominated and won their category of ‘Excellence in partnerships’. If you were unable to join the event, you can watch it back on YouTube. The staff awards brochure can now also be viewed online, which includes an overview of all the shortlisted nominees. 

Excellence in partnerships 

“An individual or a team that has demonstrated outstanding commitment and professionalism through partnership working, to a high-quality service to our students, staff or stakeholders”. 

Winners: RIILP Administrative Team: April Harper, Amanda Bloore, Suman Hira, Kate Wilson, Emma Franklin for supporting the institutional research trajectory by providing infrastructure, systems and processes as well as a positive attitude.

Congratulations Ladies!

Technologies for Translation and Interpreting: Challenges and Latest Developments

Dr Maria Kunilovskaya, University of Wolverhampton

05 November 2021

Title: Human Translation Quality Estimation and Translationese


In the first part of the talk I will present a fairly novel NLP task of human translation quality estimation (HTQE) and discuss problems associated with benchmarking human translation quality. How far do human assessors agree on (human) translation quality? What types of labels/scores can be used to reflect quality? What are the existing approaching to predict these labels? If a professional jury in a translation contest manages to achieve agreement on the top-ranking and, especially on bottom-ranking, translations (with possible fine-grained disagreements about the exact ranks) what does it take to teach a machine to distinguish between good and bad translations? Such a model can be applied in educational and certification contexts for filtering out translations that are definitely below the expected standard to reduce the workload for human assessors. The second part of the talk will explore the concept of translationese, and its potential for learning human translation quality. Do you expect good translations to read smoothly and naturally as if originally-written in the target language? Can we use the distance between translations and the expected target language norm to measure translation quality? I will largely draw on the findings reported in our latest publications: 

  • Kunilovskaya, M. and G. Corpas Pastor (2021). Translationese and register variation in English-to-Russian professional translation. In L. Lim, D. Li, and V. Wang (Eds.), New Perspectives on Corpus Translation Studies. Springer. 
  • Kunilovskaya, M., Lapshinova-Koltunski, E., & Mitkov, R. (2021). Translationese in Russian Literary Texts. Proceedings of the 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature. EMNLP.


Maria Kunilovskaya has been engaged in translator education for more than 10 years in her role as an Associate Professor in Contrastive Linguistics and Translation Studies at the University of Tyumen, Russia. Lecturing in Translation Studies, Corpus Linguistics and Text Linguistics, she has also been involved with teaching practical translation classes. She is a strong believer in promoting practical corpus skills that can be immediately applied in everyday activities of a language professional. Her research interests include construction and exploitation of parallel corpora, corpus-based research into translation competence and translationese, most recently with a strong pull towards the computational research methods, especially in the area of human translation quality estimation.

Technologies for Translation and Interpreting: Challenges and Latest Developments

Dr Laura Mejías Climent, Jaume I University

29 October 2021

Title: A technological approach to audiovisual translation: How to localize a video game


New technologies have brought about the emergence of modern forms of audiovisual entertainment. In this current and technologized landscape, localization has become a key industry to ensure that all kinds of digital, multimedia and multimodal products reach markets different from the one where the product was originally developed. It is a complex process encompassing the adaptation of the product at different levels, not only the linguistic one but also at technical, legal and aesthetic levels. Localization is typically used to modify software products, video games and website content. Each group share aspects such as the digital and technological nature of the products and their added interactive dimension. The process of localization in each group is also similar to a certain extent. Nonetheless, some differences can be noticed when analyzing the processes thoroughly. In this context, this presentation aims to describe the particularities that localization entails when dealing with video games and their audiovisual assets. To do so, the concept of video games as multimodal and technological products will be reviewed, as well as some key aspects of the localization industry, focusing on the adaptation of audiovisual contents requiring some form of audiovisual translation (dubbing or subtitling). 


Laura Mejías-Climent holds a PhD in Translation (Universitat Jaume) and works as an Assistant Professor and researcher (group TRAMA) at the same university. She has taught at the Universidad Pablo de Olavide and ISTRAD (both in Sevilla), and teaches at the Universidad Europea (Valencia). She has worked as a translation project manager and a professional translator specialized in audiovisual translation and localization. She has also taught in the USA thanks to a Fulbright scholarship. In addition to her PhD, she holds a Master’s Degree in audiovisual translation, a Master’s Degree in translation and new technologies, and completed the Master’s Degree in Secondary Education and Languages. Her lines of research focus on Descriptive Translation Studies (translation for dubbing and video game localization), and she is currently involved in a research project combining machine translation and dubbing.

RCGL Seminars logo

Natural Language Processing

George Chrysostomou, The University of Sheffield

25 October 2021

Title: Improving Explanations for Model Predictions


Large neural models dominate benchmarks of natural language understanding tasks. Their achievements have led in increasing adoption in critical areas such as that of health and law. A significant drawback of these models is their highly parameterized architecture, which makes their predictions hard to interpret. Previous work has introduced approaches for generating rationales for model predictions (e.g. using feature attribution). However, how accurately these approaches explain the reasoning behind a model’s prediction has only recently been studied. This seminar will introduce three studies which aim to improve explanations for model predictions: (1) Improving the Faithfulness of Attention-based Explanations with Task-specific Information for Text Classification (published at ACL2021); (2) Towards Better Transformer-based Faithful Explanations with Word Salience (published at EMNLP 2021); (3) Instance-level Rationalization of Model Predictions (Under review at AAAI 2021).


George Chrysostomou is a PhD student at the University of Sheffield, supervised by Dr. Nikolaos Aletras and Dr. Mauricio Alvarez. His research interests lie in improving explanations for model predictions in Natural Language Processing. Before pursuing his doctoral studies, he did his masters in Data Analytics at the University of Sheffield. 

RCGL Seminars logo

Machine Learning/Deep Learning

Dr Yuval Pinter, Ben Gurion University of the Negev, Isarel

Challenging and Adapting NLP Models to Lexical Phenomena

12 October 2021


Over the last few years, deep neural models have taken over the field of natural language processing (NLP), brandishing great improvements on many of its sequence-level tasks. But the end-to-end nature of these models makes it hard to figure out whether the way they represent individual words aligns with how language builds itself from the bottom up, or how lexical changes in register and domain can affect the untested aspects of such representations.

In this talk, I will present NYTWIT, a dataset created to challenge large language models at the lexical level, tasking them with identification of processes leading to the formation of novel English words, as well as with segmentation and recovery of the class of novel blends. I will then present XRayEmb, a method which alleviates the hardships of processing these novelties by fitting a character-level encoder to the existing models’ subword tokenizers; and conclude with a discussion of the drawbacks of current tokenizers’ vocabulary creation schemes.


Yuval Pinter is a Senior Lecturer in the Department of Computer Science at Ben-Gurion University of the Negev, focusing on NLP. Yuval got his PhD at the Georgia Institute of Technology School of Interactive Computing as a Bloomberg Data Science PhD Fellow. Before that, he worked as a Research Engineer at Yahoo Labs and as a Computational Linguist at Ginger Software, and obtained an MA in Linguistics and a BSc in CS and Mathematics, both from Tel Aviv University. Yuval blogs (in Hebrew) about language matters on Dagesh Kal.