Title Inference acceleration for transformer-based architectures
Translation of Title Prognozavimo spartinimas transformerių architektūroms.
Authors Grigaliūnas, Domas
Full Text Download
Pages 50
Keywords [eng] inference ; temporal fusion transforme ; assisted generation ; large language model
Abstract [eng] This Master’s thesis focuses on improving inference performance in two transformer-based machine learning architectures. The first one is the Temporal Fusion Transformer (TFT), a strong baseline machine learning model architecture for multivariate, multi-horizon time series prediction. The second one is an encoder-decoder-based Large Language Model (LLM) such as Google’s Gemini. These models are usually trained on large-scale datasets, deployed at cloud scale, and served to millions of users concurrently. Because of this, an inference (i.e. model serving speed) is critically important. For the TFT model, the goal is to identify potential improvements, apply them, and compare the results against the original implementation. Similarly, for the LLM, the work explores how assisted generation techniques can be applied to reduce inference latency, following the same cycle of identifying, implementing, and evaluating improvements. The document is structured into four chapters. The first chapter presents the theoretical foundations and related works. The second outlines the proposed improvements and environments used for experimentation. The third details the experimental setups and results. The final chapter summarises findings and concludes the study.
Dissertation Institution Kauno technologijos universitetas.
Type Master thesis
Language English
Publication date 2025