Seimo posėdžių stenogramų tekstynas autorystės nustatymo bei autoriaus profilio sudarymo tyrimams

Jurgita Kapočiūtė-Dzikienė; Ligita Šarkutė; Andrius Utka

Title	Seimo posėdžių stenogramų tekstynas autorystės nustatymo bei autoriaus profilio sudarymo tyrimams
Translation of Title	Corpus of transcribed parliamentary speeches for authorship attribution and author profiling tasks.
Authors	Kapočiūtė-Dzikienė, Jurgita ; Šarkutė, Ligita ; Utka, Andrius
Full Text
Is Part of	Kalbotyra : mokslo darbai = Linguistics.. Vilnius : Vilniaus universitetas. 2014, T. 66, p. 27-45.. ISSN 1392-1517
Keywords [eng]	Lithuanian parliamentary speeches ; Author profiling task ; Authorship attribution ; Corpus
Abstract [eng]	In our paper we present a corpus of transcribed Lithuanian parliamentary speeches. The corpus is prepared in a specific format, appropriate for different authorship identification tasks. The corpus consists of approximately 111 thousand texts (24 million words). Each text matches one parliamentary speech produced during an ordinary session from the period of 7 parliamentary terms starting on March 10, 1990 and ending on December 23, 2013. The texts are grouped into 147 categories corresponding to individual authors, therefore they can be used for authorship attribution tasks; besides, these texts are also grouped according to age, gender and political views, therefore they are also suitable for author profiling tasks. Whereas short texts complicate recognition of author speaking style and are ambiguous in relation to the style of other authors, we incorporated only texts containing not less than 100 words into the corpus. In order to make each category as comprehensive and representative as possible, we included only those authors, who produced speeches at least 200 times. All the texts are lemmatized, morphologically and syntactically annotated, tokenized into the character n-grams. The statistical information of the corpus is also available. We have also demonstrated that the created corpus can be effectively used in authorship attribution and author profiling tasks with supervised machine learning methods. The corpus structure also allows using it with unsupervised machine learning methods and can be used for creation of rule-based methods, as well as in different linguistic analyses.
Published	Vilnius : Vilniaus universitetas
Type	Journal article
Language	Lithuanian
Publication date	2014

„Seimo posėdžių stenogramų tekstynas autorystės nustatymo bei autoriaus profilio sudarymo tyrimams“