Inference acceleration for transformer-based architectures

Domas Grigaliūnas

Title	Inference acceleration for transformer-based architectures
Translation of Title	Prognozavimo spartinimas transformerių architektūroms.
Authors	Grigaliūnas, Domas
Full Text
Pages	50
Keywords [eng]	inference ; temporal fusion transforme ; assisted generation ; large language model
Abstract [eng]	This Master’s thesis focuses on improving inference performance in two transformer-based machine learning architectures. The first one is the Temporal Fusion Transformer (TFT), a strong baseline machine learning model architecture for multivariate, multi-horizon time series prediction. The second one is an encoder-decoder-based Large Language Model (LLM) such as Google’s Gemini. These models are usually trained on large-scale datasets, deployed at cloud scale, and served to millions of users concurrently. Because of this, an inference (i.e. model serving speed) is critically important. For the TFT model, the goal is to identify potential improvements, apply them, and compare the results against the original implementation. Similarly, for the LLM, the work explores how assisted generation techniques can be applied to reduce inference latency, following the same cycle of identifying, implementing, and evaluating improvements. The document is structured into four chapters. The first chapter presents the theoretical foundations and related works. The second outlines the proposed improvements and environments used for experimentation. The third details the experimental setups and results. The final chapter summarises findings and concludes the study.
Dissertation Institution	Kauno technologijos universitetas.
Type	Master thesis
Language	English
Publication date	2025

„Inference acceleration for transformer-based architectures“