Abstract:
The density peak clustering (DPC) algorithm has garnered significant attention in the research community because of its simple, intuitive, and efficient framework for identifying cluster centers in datasets. The core strength of the algorithm lies in its ability to discover clusters of arbitrary shapes by making a fundamental assumption that cluster centers are characterized by a higher local density than their neighbors and are relatively distant from points with higher densities. Despite its effectiveness, the canonical DPC algorithm has several critical limitations that limit its performance and applicability, particularly for complex and large-scale data. First, its computational complexity is prohibitively high, scaling quadratically with the number of samples, because it requires the calculation of a full pairwise Euclidean distance matrix. Second, its definition of local density fails to account for inter-cluster density variations, often leading to erroneous selection of cluster centers in datasets with heterogeneous density distributions. Third, its sequential, chain-like assignment strategy for non-center points is susceptible to a “domino effect,” where a single incorrect assignment can trigger a cascade of subsequent errors, severely compromising the final clustering accuracy. To overcome these deficiencies, this study proposes a novel and robust variant of the DPC algorithm, called the density peak clustering algorithm with circle-division sampling and reverse nearest neighbor optimization (CDPC-RNN). The proposed algorithm systematically enhanced each stage of the DPC process. First, to address the computational bottleneck, we introduced a circle-division sampling method. This strategy effectively generated a set of representative points that preserved the underlying data distribution while substantially reducing the number of samples required for distance calculations, thereby significantly decreasing the time overhead of the algorithm. To mitigate any potential loss of clustering precision resulting from this sampling, an approximate
K-nearest neighbor strategy was employed to fortify the topological link between the representative points and original samples. Second, we refined the cluster center identification process by optimizing the definition of local density. By leveraging the concept of reverse nearest neighbors, our approach reevaluated the local density of a point based on its surrounding environment. This adaptive density calculation allowed the algorithm to accurately distinguish true density peaks from spurious ones, particularly in complex datasets where clusters exhibited significant variations in density and scale. Finally, we replaced the fragile chain-like assignment policy with a more robust mechanism. By calculating the similarity between representative points based on their shared reverse nearest neighbors, we constructed a global similarity matrix. The final assignment of points to clusters was guided by this matrix, which circumvent the cascading errors inherent in the sequential approach of the original DPC algorithm. The performance of the proposed algorithm was rigorously evaluated through a series of comparative experiments conducted on diverse datasets, including synthetic datasets with complex morphological structures, real-world benchmark datasets, and large-scale datasets. The experimental results unequivocally demonstrated that our algorithm has a significant advantage over both the original DPC and several other state-of-the-art DPC-based improvements. The proposed CDPC-RNN method achieved substantial enhancements in both clustering accuracy and computational efficiency, establishing it as a powerful and reliable solution for a wide range of clustering tasks.