Title Comparison of modern multilingual text embedding techniques for hate speech detection task
Authors Vaičiukynas, Evaldas ; Danėnas, Paulius ; Ablonskis, Linas ; Šukys, Algirdas ; Dambrauskas, Edgaras ; Žitkus, Voldemaras ; Butkienė, Rita ; Butleris, Rimantas
DOI 10.3390/app16105099
Full Text Download
Is Part of Applied sciences.. Basel : MDPI. 2026, vol. 16, iss. 10, art. no. 5099, p. 1-24.. ISSN 2076-3417
Keywords [eng] anomaly detection ; dimensionality reduction ; hate speech detection ; machine learning ; sentence embeddings
Abstract [eng] Online hate speech and abusive language pose a growing challenge for content moderation, especially in multilingual settings and for low-resource languages such as Lithuanian. This paper investigates to what extent modern multilingual sentence embedding models can support accurate hate speech detection in Lithuanian, Russian, and English, and how their performance depends on downstream modeling choices and feature dimensionality. We introduce LtHate, a new Lithuanian hate speech corpus derived from news portals and social networks, and benchmark six modern multilingual encoders (gemma, qwen, bge, snow, jina, and e5) on LtHate, RuToxic, and EnSuperset using a unified Python pipeline. For each embedding type, we train both a one-class histogram-based anomaly detector (HBOS) and a two-class gradient-boosted tree ensemble (CatBoost), with and without Principal Component Analysis (PCA) compression to 32-dimensional feature vectors. Across all datasets, two-class supervised models consistently and substantially outperform one-class anomaly detection, with the best configurations achieving up to 78.8% accuracy (Kappa 0.58, AUC ROC 0.87) in Lithuanian (jina), 92.2% accuracy (Kappa 0.77, AUC ROC 0.97) in Russian (e5), and 76.9% accuracy (Kappa 0.48, AUC ROC 0.86) in English (e5). PCA compression deteriorates the discriminative power of CatBoost only slightly, with much more negative impact for the HBOS model. These results demonstrate how modern multilingual sentence embeddings combined with gradient-boosted decision trees provides robust machine learning solutions for multilingual hate speech detection applications.
Published Basel : MDPI
Type Journal article
Language English
Publication date 2026
CC license CC license description