Comparison of modern multilingual text embedding techniques for hate speech detection task

Evaldas Vaičiukynas; Paulius Danėnas; Linas Ablonskis; Algirdas Šukys; Edgaras Dambrauskas; Voldemaras Žitkus; Rita Butkienė; Rimantas Butleris

doi:10.3390/app16105099

Title	Comparison of modern multilingual text embedding techniques for hate speech detection task
Authors	Vaičiukynas, Evaldas ; Danėnas, Paulius ; Ablonskis, Linas ; Šukys, Algirdas ; Dambrauskas, Edgaras ; Žitkus, Voldemaras ; Butkienė, Rita ; Butleris, Rimantas
DOI	10.3390/app16105099
Full Text
Is Part of	Applied sciences.. Basel : MDPI. 2026, vol. 16, iss. 10, art. no. 5099, p. 1-24.. ISSN 2076-3417
Keywords [eng]	anomaly detection ; dimensionality reduction ; hate speech detection ; machine learning ; sentence embeddings
Abstract [eng]	Online hate speech and abusive language pose a growing challenge for content moderation, especially in multilingual settings and for low-resource languages such as Lithuanian. This paper investigates to what extent modern multilingual sentence embedding models can support accurate hate speech detection in Lithuanian, Russian, and English, and how their performance depends on downstream modeling choices and feature dimensionality. We introduce LtHate, a new Lithuanian hate speech corpus derived from news portals and social networks, and benchmark six modern multilingual encoders (gemma, qwen, bge, snow, jina, and e5) on LtHate, RuToxic, and EnSuperset using a unified Python pipeline. For each embedding type, we train both a one-class histogram-based anomaly detector (HBOS) and a two-class gradient-boosted tree ensemble (CatBoost), with and without Principal Component Analysis (PCA) compression to 32-dimensional feature vectors. Across all datasets, two-class supervised models consistently and substantially outperform one-class anomaly detection, with the best configurations achieving up to 78.8% accuracy (Kappa 0.58, AUC ROC 0.87) in Lithuanian (jina), 92.2% accuracy (Kappa 0.77, AUC ROC 0.97) in Russian (e5), and 76.9% accuracy (Kappa 0.48, AUC ROC 0.86) in English (e5). PCA compression deteriorates the discriminative power of CatBoost only slightly, with much more negative impact for the HBOS model. These results demonstrate how modern multilingual sentence embeddings combined with gradient-boosted decision trees provides robust machine learning solutions for multilingual hate speech detection applications.
Published	Basel : MDPI
Type	Journal article
Language	English
Publication date	2026
CC license

„Comparison of modern multilingual text embedding techniques for hate speech detection task“