Abstract [eng] |
With the development of information technologies and the amount of data created and stored by business, the need for digitization of companies and automation of processes is noticeable. One of the most important and most influential information technology systems in business is the database. Choosing the right database can have an impact on subsequent business decisions to store, analyze and manage the available data. One of the most popular types of databases in business – relational databases - are usually chosen in the activities of companies due to their structured data schema, stability and convenience for daily operations. In order to extract business-useful information from available data and make decisions, the information stored in relational databases can be used for machine learning models, however, additional transformations have to be done. Data used for machine learning models is usually presented in the form of a table, therefore relational structure must be transformed and prepared during the feature engineering stage to train the model. This project analyzes the possibilities of feature engineering automation in relational databases. The aim is to examine systems and algorithms that can be used to automate data preparation for model training. Python package Featuretools, which uses the DFS algorithm for feature creation, and getML system, which has the possibility of automating feature engineering with FastProp, Relboost, Multirel and RelMT algorithms, are examined. The analysis is performed by examining 6 relational databases. Additional features are automatically generated from available database tables and the resulting feature table is used to train the XGBoost model. The obtained model evaluation results show that the model accuracy results obtained after automating feature engineering are considered good for both classification and regression tasks. The considered algorithms can create from 17 to 238 additional features in a few minutes using mathematical methods. Comparing the accuracy of the analyzed database models, the best results were obtained with the FastProp algorithm of getML system. |