Title Verslo procesų žaidybinimas ir imitacinis modeliavimas pritaikant skatinamąjį mokymąsi ir sintetinius duomenis
Translation of Title Gamification and simulation of business processes using reinforcement learning and synthetic data.
Authors Juozelskis, Justas
Full Text Download
Pages 90
Keywords [eng] reinforcement learning ; business process simulation ; gamification ; synthetic data
Abstract [eng] The final project analyzes the application of reinforcement learning methods in simulating business processes. The relevance of the study lies in the fact that applying reinforcement learning to business process simulation enables the modeling of complex and uncertain business scenarios using synthetic data. Synthetic data reflects the outcomes of experimental conditions and allows process analysis without real-world risks, while gamification helps incorporate elements of a dynamic environment typical of real business situations. The study describes two simulated business process environments. The first environment involves scenarios related to high-value client management, while the second introduces the service of risky orders under market constraints. In both environments, the reinforcement learning agent is trained using a Markov Decision Process structure, where states are represented by multidimensional vectors incorporating data on clients, obstacles, orders, and the company’s status in the environment. The action space is defined by four movement directions, and the reward function consists of five components: base reward for servicing clients, reward for efficient movement, penalties for inefficient movement, collision with obstacles, and execution of risky orders. Gamification is used as a methodological foundation in designing the simulation environment logic, based on principle of the classic “Snake” game, enhanced with business logic. DQN and PPO methods, implemented using Stable-Baselines3 library, are applied for training the reinforcement learning agents. The agent training process is divided into three stages – training, exploration and application during evaluating the service of regular and high-value clients, total rewards, and error frequency. In the high-value client service environment, the PPO method demonstrated better results, averaging 14 regular and 4.1 high-value clients served, with a maximum episodic reward of 2011.2. In contrast, DQN averaged only 10.8 clients served, with a reward of 1076.4. The PPO method not only reduced the number of unserved high-value clients but also optimized movement logic, showing a 75th percentile value more 2.7 times higher than DQN method result. In the risky order service environment, PPO results remained significantly superior. The average PPO reward during the application phase was 708.9, compared to DQN‘s 288.9. PPO almost entirely avoided obstacle collisions in the application phase, whereas DQN encountered at least one obstacle in every seventh episode. The PPO method not only generates higher rewards but also maintains behavioral consistency integrates high-value clients, and dynamically adapts to risky decisionmaking within its applied strategy. Meanwhile, DQN exhibits frequent local optimization and an inability to effectively adapt to business logic.
Dissertation Institution Kauno technologijos universitetas.
Type Master thesis
Language Lithuanian
Publication date 2025