Abstract [eng] |
This thesis main goal is to detect anomalies in “Little Data” data set. E-commerce is very competitive, because of that, it is very important to know what determines customer decision to buy or not. Anomaly detection may help to find new determinants. Anomalies are big, sudden unknown changes in data. Ability to know causes of anomaly, gives the ability to control it. Data set contains 9 variables, but only 7 of them are used in analysis. It’s two years daily data from 2014-06-30 to 2016-07-01. In 732 rows of data, there is 26 anomalies. Default data is unstationary and has seasonal affect. Because of that data set was differenciated and seasonal adjusted. There was used three models in anomaly detection – Auto Regressive Integrated Moving Average with Exogeneous Input (ARIMA), Vector autoregression with Exogeneous Input (VAR) models and R package called “AnomalyDetection”. With every method there was used three dfferent confidence intervals (90 %, 95 %, 99 %). In every analysis data was divided by year and variable. At the end, results from all variables was summed. To evaluate models confussion matrix and its metrics (specificity and precision) were used. ARIMA model detected 14 anomalies with 99 % confidence intervals, 16 with 95 % confidence intervals and 18 with 90 % confidence intervals. VAR model detected 15 anomalies with 99 % confidence intervals, 17 with 95 % confidence intervals and 20 with 90 % confidence intervals. R package “AnomalyDetection” got the worst result – 13 anomalies with 99 % confidence intervals, 14 with 95 % confidence intervals and 14 with 90 % confidence intervals. To sum up, we can say that all models were average in anomaly detection. As a recommendation, longer time series and more data about anomalies would help to improve model accuracy. |