Abstract [eng] |
In this research project, the task of detecting vulnerabilities in software code using natural language processing and random forests methods is investigated. The literature review section presents existing research in the field of vulnerability detection, the methods used, developed models, datasets used for their development, and their issues. The research part notes that static analyzers suffer from a high number of false positive predictions. The benefits of machine learning methods for addressing this problem are analyzed. The DiverseVul dataset of C code files was used for experiments. The data was processed, and the source code was converted into a vector representation using the CodeLlama large language models. During the experiment, four random forest classifiers were evaluated and compared with the static code analysis tool Flawfinder. To measure detection accuracy, metrics such as accuracy, precision, recall, specificity, and Cohen's Kappa were used. Quantitative indicators of the created models were experimentally investigated, and it was found that the best-performing configuration was the CodeLlama large language model with 7 billion parameters and a random forest model where the tree's leaf size was chosen using the square root of the number of features rule, applying a classification threshold of 0.48 (determined by the equal error rule). This model configuration outperformed the static analyzer Flawfinder, detecting more than three times as many correct vulnerable files. It was identified that by selecting a classification threshold such that the ratio of correct target class predictions matched the ratio mentioned by the Flawfinder tool, a 64% reduction in false positive cases was observed compared to the mentioned static analysis tool. Based on the conducted empirical research, a demonstrative tool for detecting code vulnerabilities was designed and implemented. The tool was tested manually according to scenarios. |