For evaluation purposes, the participants are expected to produce an XML output as indicated for each task. In all the cases the scores will be calculated using the IDs and start-end positions of the elements and not the string indicated in the value attribute. This means that in the evaluation of task 1 the results will be the same regardless whether the output is
<pair id="p6">
<pronoun id="62" value=" it"/>
<antecedent id="4" value=" the Palestinian Authority"/>
</pair>
or
<pair id="p6">
<pronoun id="62"/>
<antecedent id="4"/>
</pair>
Task 1 will be evaluated using success rate. Because in task 1 the pronouns to be resolved and the candidates are known success rate is calculated as the ratio between the number of pronouns correctly resolved divided to the total number of pronouns to be resolved.
The anaphora resolution method can select any entity from the coreferential chain. For each pronoun to be resolved (i.e. from the input file) the following scores are given:
- 0: the pronoun is not correctly resolved to an entity from the coreferential chain
- 0.5: a pronoun from the coreferential chain is selected as antecedent, but the program failed to resolve this pronoun to a correct NP (this case was introduced to acknowledge the fact that the program could select a correct antecedent, therefore it detects the identity of the mentioned entities, but it has no knowledge of who they actually are)
- 1: the pronoun is correctly resolved to an entity from the chain. If a pronoun is selected as antecedent, then there is at least one antecedent in the co-reference chain which is non-pronominal.
Task 2 will be evaluated using precision, recall and f-measure. These are calculated using the MUC scores as defined in (Vilain et. al., 1995)
Task 3 is evaluated using modified versions of precision and recall. In this task the pronouns to be resolved are not indicated in the input file. For this reason non-referential pronouns need to be filtered out. This makes necessary to use precision and recall. Because the candidates are not known, it is possible that there will not be a perfect match between the entities in the gold standard and those identified by the program. For this reason we introduced the following overlap measure between two strings:
- length(overlap string) represents the length in words of the string resulting from their overlap.
- max(length(Str1), length(Str2)) represents the longest string
For example the overlap between the government of Zair and Zair's government is 0 whereas the overlap between the government of Zair and the government is 0.5.
To calculate precision and recall the following formulae are used:
- str1 is a string present in the automatic output
- str2 is a string from the gold standard which maximises the overlap score
As in the task 1, if a pronoun is resolved to another pronoun the score is 1 if there is there is at least one antecedent in the co-reference chain which is non-pronominal, and 0.5 if there is no non-pronominal element in the chain or one of the pronouns in the chain is not correctly resolved.
Task 4 will be evaluated using precision, recall and f-measure. These are calculated using a modified version of the metrics proposed in (Vilain et. al., 1995). The versions we use, instead of counting the number of common pairs, we use the overlap metric proposed for task 3. This means that when a pair is compared, the overlap between its elements is calculated.