Abstract [eng] |
When automatically extracting information from text, it is important to identify different mentions of the same entity or object and to aggregate the semantic information near these mentions. When different mentions of the same object in the text refer to the same object - they corefer. Process of determining these relationships (usually with the help of computer programs) is called coreference resolution. Most coreference resolution approaches were developed for English and other major languages. Small languages, such as Lithuanian, did not receive attention. This dissertation analyses the problem of coreference resolution for English, Lithuanian and languages related to Lithuanian. The resolution of coreferences for the Lithuanian language is approached in a comprehensive way in order to create not only a solution for coreference resolution, but also opportunities for creation of new solutions and their improvement. A four-level annotation scheme has been created, indicating what and how should be resolved, while allowing to save more linguistic information. The first Lithuanian coreference corpus (LCC) was created. A rule-based method for solving coreferences for the Lithuanian language was developed. The rules are tested on related languages and formalized using first-order predicate logic. The created text and resolution methods use the created annotation scheme. The proposed new evaluation methodology utilizes the advantages of the created annotation scheme. When evaluating the resolution approaches, it allows to take into account not only the resolved coreferences, but also the quality of the resolution. |