| Abstract [eng] |
With the rapid advancement of artificial intelligence in natural language processing, chatbots are becoming more sophisticated. However, developing them for low-resource languages like Lithuanian presents challenges due to limited data and high training costs. This study analyzes and improves Lithuanian dialogue models by leveraging large language models (LLMs) and parameter-efficient fine-tuning (PEFT) methods. The proposed strategy includes synthetic data generation, integration with Lithuanian datasets, model adaptation using the LoRA method, comprehensive data filtering, and tokenization evaluation. The LLaMA 3.2 and Gemma 3 models (1B and 4B parameters) were trained and evaluated using standardized Lithuanian benchmarks (MMLU-LT, ARC-LT, TruthfulQA-LT). The analysis shows that PEFT methods, especially LoRA, effectively adapt LLMs to the Lithuanian language using significantly fewer resources than full retraining. Generating synthetic data and combining it with filtered public datasets substantially improved the models’ contextual understanding and response quality. Among the models, Gemma 3 4B (LoRA) achieved high results on the TruthfulQA MC2 (54.0%) and ARC-LT (50.8%) tests, while LLaMA 3.2 3B led in MMLU (38.8%) and BLEU (0.323) scores. The PEFT models developed in this study achieved competitive or even superior results compared to larger, fully retrained LLaMA models. The findings indicate that the strategic use of PEFT methods and high-quality, diverse datasets enables the successful development and enhancement of Lithuanian dialogue models with limited resources, contributing to the advancement of Lithuanian language technologies. |