Automated Archiving for an Institutional Repository
Overview
Manual population of institutional repositories with citation data is an extremely time- and resource-intensive process. These costs act as a bottleneck on the fast uptake of large repositories. The challenge has long been recognised and a number of research projects have attempted to develop the technology for unifying disjointed repositories, their efficient management and re-use. Nonetheless, these technologies still fail to address the main problem of construction and upkeep of bibliographic repositories - the discovery of new citation data in large text collections.
The JISC-funded project "Automated Archiving for an Institutional Repository" aims to develop an information extraction system allowing for speedy discovery and extraction of bibliographical data from semi-structured text. The system is integrated with the WIRE ("Wolverhampton Intellectual Repository and E-theses") repository, but it is designed in a way to facilitate easy adoption of the software by other institutions that use different data encoding standards. The project was carried out jointly by Research Institute in Information and Language Processing (ILP) and the Learning Information Services (LIS) of the University of Wolverhampton.
The project started in September 2007 and finished in April 2009.
Aims and objectives
The project investigates the degree to which the population of institutional repositories can be automated, in order to maximise the speed of human-supervised compilation of the data, while maintaining its high quality. Employing the state-of-the-art methods for Natural Language Processing and Information Retrieval, the project designs a software architecture that helps a user to:
locate relevant documents on the institutional website,
extract bibliographical entries from them,
extract information from each entry and tag it with Dublin Core metadata tags such as Author, Title, and Year,
The ILP research staff is responsible for the research and development activities on the project, which is concerned with methods to locate relevant web documents on an institutional server and extract and verify bibliographical metadata from them. These methods are implemented in four major services of the system: a web crawler, an information extraction component, a user interface and the WIRE interfacing component.
The LIS staff is responsible for specification of user needs while developing web interfaces, overseeing trialling within the University of Wolverhampton (through testing automated population of the existing WIRE repository), support and promotion of the AIR system to the research community.
raise the profile of the University research by increasing the likelihood of citation,
provide opportunities for ILP researchers to gain experience in knowledge transfer,
free up LIS staff time by introducing mediated deposition process,
develop the relationship between LIS and ILP which may lead to further co-operation on advanced information access technologies.
The main concrete outcome is the software architecture integrated with the WIRE repository, but easily customisable for other repositories that use different data encoding standards. The source code and user documentation is allocated at the AIR development page.
Project team
Prof. Dr. Ruslan Mitkov, Professor of Computational Linguistics and Language Engineering, and Director of the
Research Institute in Information and Language Processing
Dr. Viktor Pekar, University of Wolverhampton, ILP (at the beginning of the project)
Natalia Ponomareva, Research Fellow at the University of Wolverhampton, ILP
Dr. Jose Manuel Gomez, guest Research Fellow at the University of Wolverhampton, ILP (at the final step of the project)
Javier Espinosa de los Monteros, part-time Research Assistant at the University of Wolverhampton, ILP
John Dowd, Hybrid Services Manager at the University of Wolverhampton, LIS
Frances Machell, Hybrid Collections Co-ordinator at the University of Wolverhampton, LIS
Alison Robinson, University of Wolverhampton, LIS
Roswita Burns, University of Wolverhampton, LIS (at the final step of the project)
Contact details
Research Institute in Information and Language Processing
University of Wolverhampton
Wolverhampton, WV1 1SB
United Kingdom