An automatic system to Build Resource Databases for researchers

The goal of this project is to produce a system which will mine information about resources for different research domains on the Internet, and produce a database that will be accessible online for anyone wishing to exploit it. The system is designed to help researchers, students, lecturers, project managers, and even industrial companies and funding agencies.

Motivation
Research questions
The system to be implemented

People

Prof. Ruslan Mitkov

Project Manager

Richard Evans

Research Fellow

Dr. Viktor Pekar

Research Fellow



Motivation

In recent years, the Internet has developed rapidly, leading to an explosion in the amount of information that one can obtain. However, having too much information is a problem in itself, making it very difficult to find particular information which was previously produced by another researcher. General-purpose search engines are often too crude or require too much specific knowledge to effectively locate the sought information.

Given that general-purpose tools are not sufficient for researchers, specifically designed search tools should be used instead. The purpose of this project is to build a system that will enable researchers to easily locate the resource they require. Here, a resource may be a program which performs a certain task, a corpus with certain characteristics, the transcription of a speech, etc. Researchers who cannot locate a resource are often forced to recreate it, resulting in a significant waste of time and effort. This waste could be avoided if there was a comprehensive list available of existing resources. Such lists can be built by hand from announcements on mailing lists, but this is a tedious process which often leads to incomplete lists.

The first implementation of the system will be tuned to the field of computational linguistics because we are most familiar with this area, but we are confident that this will not restrict the generality of the system, and it will be applicable to any domain with minimal changes.

Research questions

This project will investigate the extent to which methods from computational linguistics can be used to automatically compile lists of resources. In addition to the system to be developed, the project will provide new insights into the discourse structure of email messages and web pages, and will create a corpus of emails and web pages containing information about resources which will be annotated for inter and intra document coreference, and for important notions, such as names. Emerging fields from computational linguistics like cross-document coreference and multi-document summarisation will be investigated. A new evaluation methodology will also be elaborated. Templates will be used to encode information about resources. Templates are normally built by experts which makes them expensive. A semi-automatic corpus-based template acquisition process will be sought in this project. All the modules to be developed in this project will be tuned to process web pages. This will be beneficial for other researchers processing similar texts.

The system to be implemented

Firstly, the system will linguistically analyse messages from mailing lists. When one is identified as announcing a resource, all the documents indicated in the message will be downloaded for processing. The resource will be classified into one or more domain-specific types by reference to the significant terms identified in the announcing document. In computational linguistics, these types may include programs, corpora and frequency lists.

The names of the resource, its authors, and other related resources must be identified. This process, called named entity recognition (NER), is a challenging task. A system for NER specially tuned for web pages will be developed. It will identify common types of named entities as well as the names of different resources. Recognition of these latter entities is a little investigated area.

The templates containing the details about a resource will have to be filled in with pieces of information from more than one web page. Multi-document summarisation will be useful in combining such complementary information. Cross-document coreference will make it possible to know if the same entity is referred to in more than one document. In this way it will be possible to link resources that have the same author, have been produced at the same institution, or require the use of the same external resource. Simple generation techniques will be used to generate the information about a resource.

The information gathered about a resource will be stored in a database which will be globally accessible. Standard inter-archive communication protocols will be implemented in order to make it possible to search for this information from other archives.

Papers

Resources