Title Inference acceleration for Large Language Models using "stairs" assisted greedy generation /
Authors Grigaliūnas, Domas ; Lukoševičius, Mantas
DOI 10.15388/Proceedings.2024.44
Full Text Download
Is Part of IVUS2024: 29th international conference "Information society and university studies", Vilnius University, Kaunas Faculty, Kaunas, Lithuania, May 17th, 2024: abstracts.. Vilnius : Vilniaus universiteto leidykla. 2024, p. 25
Abstract [eng] Large Language Models (LLMs) with billions of trained parameters are known for their impressive predicting capabilities but suffer from slow inference speeds due to their size. On the other hand, smaller models offer faster execution but may sacrifice accuracy. In this paper, we are proposing an implementation of “stairs” assisted greedy generation. It is a modified assisted generation methodology that makes use of a smaller model’s fast generation, large model’s batch prediction, and “stairs” validation in order to achieve a speed up in prediction generation. Results show between 9.58 and 17.24 percent inference time improvement compared to a stand alone large LLM prediction in a text generation task without a loss in accuracy.
Published Vilnius : Vilniaus universiteto leidykla
Type Conference paper
Language English
Publication date 2024
CC license CC license description