圆形网格抽样和逆近邻优化的密度峰值聚类算法

Density peaks clustering algorithm with circle-division sampling and reverse nearest neighbor optimization

  • 摘要: 密度峰值聚类(DPC)算法是一种简单高效的聚类算法,因其可直观和快速发现数据集中的类簇而得到广泛关注. 但DPC算法需计算所有样本间的欧氏距离,算法的时间复杂度较高;局部密度定义未考虑类簇间密度差异影响,易误选类簇中心;使用链式分配策略,易产生错误连带效应. 因此,本文提出一种圆形网格抽样和逆近邻优化的密度峰值聚类算法. 该算法采用圆形网格抽样得到代表以减少需要计算的样本数,降低算法计算的时间开销,并引入近似K近邻策略加强代表和初始样本的联系,减少抽样导致的聚类精度丢失;利用逆近邻优化局部密度定义策略,根据样本所处环境调节其局部密度的大小,准确找到密度峰值;通过共享逆近邻计算相似性,由相似性矩阵分配代表,避免样本分配策略产生的错误连带效应. 设置了复杂形态合成数据集、真实数据集和较大规模数据集进行分组实验. 实验结果表明,本文算法在复杂形态、真实及较大规模数据集上聚类优势显著,精度与效率较DPC算法及其他基于DPC的改进算法均有较大提升.

     

    Abstract: The density peak clustering (DPC) algorithm has garnered significant attention in the research community because of its simple, intuitive, and efficient framework for identifying cluster centers in datasets. The core strength of the algorithm lies in its ability to discover clusters of arbitrary shapes by making a fundamental assumption that cluster centers are characterized by a higher local density than their neighbors and are relatively distant from points with higher densities. Despite its effectiveness, the canonical DPC algorithm has several critical limitations that limit its performance and applicability, particularly for complex and large-scale data. First, its computational complexity is prohibitively high, scaling quadratically with the number of samples, because it requires the calculation of a full pairwise Euclidean distance matrix. Second, its definition of local density fails to account for inter-cluster density variations, often leading to erroneous selection of cluster centers in datasets with heterogeneous density distributions. Third, its sequential, chain-like assignment strategy for non-center points is susceptible to a “domino effect,” where a single incorrect assignment can trigger a cascade of subsequent errors, severely compromising the final clustering accuracy. To overcome these deficiencies, this study proposes a novel and robust variant of the DPC algorithm, called the density peak clustering algorithm with circle-division sampling and reverse nearest neighbor optimization (CDPC-RNN). The proposed algorithm systematically enhanced each stage of the DPC process. First, to address the computational bottleneck, we introduced a circle-division sampling method. This strategy effectively generated a set of representative points that preserved the underlying data distribution while substantially reducing the number of samples required for distance calculations, thereby significantly decreasing the time overhead of the algorithm. To mitigate any potential loss of clustering precision resulting from this sampling, an approximate K-nearest neighbor strategy was employed to fortify the topological link between the representative points and original samples. Second, we refined the cluster center identification process by optimizing the definition of local density. By leveraging the concept of reverse nearest neighbors, our approach reevaluated the local density of a point based on its surrounding environment. This adaptive density calculation allowed the algorithm to accurately distinguish true density peaks from spurious ones, particularly in complex datasets where clusters exhibited significant variations in density and scale. Finally, we replaced the fragile chain-like assignment policy with a more robust mechanism. By calculating the similarity between representative points based on their shared reverse nearest neighbors, we constructed a global similarity matrix. The final assignment of points to clusters was guided by this matrix, which circumvent the cascading errors inherent in the sequential approach of the original DPC algorithm. The performance of the proposed algorithm was rigorously evaluated through a series of comparative experiments conducted on diverse datasets, including synthetic datasets with complex morphological structures, real-world benchmark datasets, and large-scale datasets. The experimental results unequivocally demonstrated that our algorithm has a significant advantage over both the original DPC and several other state-of-the-art DPC-based improvements. The proposed CDPC-RNN method achieved substantial enhancements in both clustering accuracy and computational efficiency, establishing it as a powerful and reliable solution for a wide range of clustering tasks.

     

/

返回文章
返回