Optimizing the Performance of a Grid-based DBSCAN Algorithm - The 21st National Industrial Engineering & Management Conference

DBSCAN has been one of the most investigated clustering algorithms since its initial introduction in 1996. Its ability to discover non-symmetric and abnormal patterns in data, together with its ability to detect outliers in data sets with different densities made it a good candidate for multiple researches. The main concern about DBSCAN was its scalability which is O(2). When it comes to processing a large dataset it may become an obstacle. Many suggestions have been made over the years to overcome this shortcoming. Some of them included linearization of nearest neighbor search around each core point. Other included more modern methods, such as using MapReduce technology. In most cases, however, these methods included a process which makes the algorithm much more complicated.

The current paper describes a different approach to the problem of DBSCAN performance improvement. According to this method, a SQL query is used to build a grid-based table. Each row in this table is equivalent to a cell in the multi-dimensional space. Each cell is equivalent to the neighboring area, as constructed by the DBSCAN. Using a distance method, which relates the traditional parameters of the DBSCAN (radius and number of points) into the size of the cells, for each cell the algorithm calculates an attribute which is similar to the Core, Border or Noise attributes, as in the original DBSCAN algorithm. Once this allocation has been completed, it is possible to differentiate between dense, non-dense and empty zones in the multi-space of the original variables, much like using the traditional DBSCAN. The differences between the traditional method and the suggested method are the following:

The new method is faster, but less accurate.
The new method enables the classification of new data, even with missing values.
The new method can handle both continuous and discrete data.

Given the above characteristics and assuming the trade-off between speed and accuracy is reasonable; the suggested algorithm may be considered a good alternative for large dataset processing.

The paper includes the following sections: Section 1, describes literature related work. Section two describes the algorithm and explains the link between the algorithm and the DBSCAN; Section three provides a numerical example based on a SQL script, which is detailed in the Appendix; and the last section, which concludes the paper.

A special Appendix includes a detailed noted Java code, which describes the algorithm.