Inference acceleration for Large Language Models using "stairs" assisted greedy generation /

Title	Inference acceleration for Large Language Models using "stairs" assisted greedy generation /
Authors	Grigaliūnas, Domas ; Lukoševičius, Mantas
DOI	10.15388/Proceedings.2024.44
Full Text
Is Part of	IVUS2024: 29th international conference "Information society and university studies", Vilnius University, Kaunas Faculty, Kaunas, Lithuania, May 17th, 2024: abstracts.. Vilnius : Vilniaus universiteto leidykla. 2024, p. 25
Abstract [eng]	Large Language Models (LLMs) with billions of trained parameters are known for their impressive predicting capabilities but suffer from slow inference speeds due to their size. On the other hand, smaller models offer faster execution but may sacrifice accuracy. In this paper, we are proposing an implementation of “stairs” assisted greedy generation. It is a modified assisted generation methodology that makes use of a smaller model’s fast generation, large model’s batch prediction, and “stairs” validation in order to achieve a speed up in prediction generation. Results show between 9.58 and 17.24 percent inference time improvement compared to a stand alone large LLM prediction in a text generation task without a loss in accuracy.
Published	Vilnius : Vilniaus universiteto leidykla
Type	Conference paper
Language	English
Publication date	2024
CC license

„Inference acceleration for Large Language Models using "stairs" assisted greedy generation /“