CORAIN-THESIS-2020.pdf (17.28 MB)
Download fileA Density-Based Method for Scalable Outlier Detection in Large Datasets
thesis
posted on 2020-08-01, 00:00 authored by Matteo CorainDBSCAN is one of the most well-known algorithm in the field of density-based clustering, although its applicability to large datasets is generally disputed due to its high complexity. The aim of this work is to propose a new, parallel, Spark-based procedure for the sole purpose of anomaly detection, in a way which is coherent to the DBSCAN definition and suitable for the big data context. From a theoretical side, this algorithm is characterized by a worst-case performance boundary that depends linearly on the size of the dataset; in practical tests, it outperforms available solutions both in terms of result quality and overall scalability when the data grow large.