An automatic system to Build Resource Databases for researchers |
![]() |
Description of the systemFirstly, the system will linguistically analyse messages from mailing lists. When one is identified as announcing a resource, all the documents indicated in the message will be downloaded for processing. The resource will be classified into one or more domain-specific types by reference to the significant terms identified in the announcing document. In computational linguistics, these types may include programs, corpora and frequency lists.The names of the resource, its authors, and other related resources must be identified. This process, called named entity recognition (NER), is a challenging task. A system for NER specially tuned for web pages will be developed. It will identify common types of named entities as well as the names of different resources. Recognition of these latter entities is a little investigated area. The templates containing the details about a resource will have to be filled in with pieces of information from more than one web page. Multi-document summarisation will be useful in combining such complementary information. Cross-document coreference will make it possible to know if the same entity is referred to in more than one document. In this way it will be possible to link resources that have the same author, have been produced at the same institution, or require the use of the same external resource. Simple generation techniques will be used to generate the information about a resource. The information gathered about a resource will be stored in a database which will be globally accessible. Standard inter-archive communication protocols will be implemented in order to make it possible to search for this information from other archives. |
|
|
|
This page is mantained by Richard Evans
|
|