武森, 汪玉枝, 高晓楠. 基于近邻的不均衡数据聚类算法[J]. 工程科学学报, 2020, 42(9): 1209-1219. DOI: 10.13374/j.issn2095-9389.2019.10.09.003
引用本文: 武森, 汪玉枝, 高晓楠. 基于近邻的不均衡数据聚类算法[J]. 工程科学学报, 2020, 42(9): 1209-1219. DOI: 10.13374/j.issn2095-9389.2019.10.09.003
WU Sen, WANG Yu-zhi, GAO Xiao-nan. Clustering algorithm for imbalanced data based on nearest neighbor[J]. Chinese Journal of Engineering, 2020, 42(9): 1209-1219. DOI: 10.13374/j.issn2095-9389.2019.10.09.003
Citation: WU Sen, WANG Yu-zhi, GAO Xiao-nan. Clustering algorithm for imbalanced data based on nearest neighbor[J]. Chinese Journal of Engineering, 2020, 42(9): 1209-1219. DOI: 10.13374/j.issn2095-9389.2019.10.09.003

基于近邻的不均衡数据聚类算法

Clustering algorithm for imbalanced data based on nearest neighbor

  • 摘要: 针对经典K–means算法对不均衡数据进行聚类时产生的“均匀效应”问题,提出一种基于近邻的不均衡数据聚类算法(Clustering algorithm for imbalanced data based on nearest neighbor,CABON)。CABON算法首先对数据对象进行初始聚类,通过定义的类别待定集来确定初始聚类结果中类别归属有待进一步核定的数据对象集合;并给出一种类别待定集的动态调整机制,利用近邻思想实现此集合中数据对象所属类别的重新划分,按照从集合边缘到中心的顺序将类别待定集中的数据对象依次归入其最近邻居所在的类别中,得到最终的聚类结果,以避免“均匀效应”对聚类结果的影响。将该算法与K–means、多中心的非平衡K_均值聚类方法(Imbalanced K–means clustering method with multiple centers,MC_IK)和非均匀数据的变异系数聚类算法(Coefficient of variation clustering for non-uniform data,CVCN)在人工数据集和真实数据集上分别进行实验对比,结果表明CABON算法能够有效消减K–means算法对不均衡数据聚类时所产生的“均匀效应”,聚类效果明显优于K–means、MC_IK和CVCN算法。

     

    Abstract: Clustering is an important task in the field of data mining. Most clustering algorithms can effectively deal with the clustering problems of balanced datasets, but their processing ability is weak for imbalanced datasets. For example, K–means, a classical partition clustering algorithm, tends to produce a “uniform effect” when dealing with imbalanced datasets, i.e., the K–means algorithm often produces clusters that are relatively uniform in size when clustering unbalanced datasets with the data objects in small clusters “swallowing” the part of the data objects in large clusters. This means that the number and density of the data objects in different clusters tend to be the same. To solve the problem of “uniform effect” generated by the classical K–means algorithm in the clustering of imbalanced data, a clustering algorithm based on nearest neighbor (CABON) is proposed for imbalanced data. Firstly, the initial clustering of data objects is performed to obtain the undetermined-cluster set, which is defined as a set that consists of the data objects that must be checked further regarding the clusters in which they belong. Then, from the edge to the center of the set, the nearest-neighbor method is used to reassign the data objects in the undetermined-cluster set to the clusters of their nearest neighbors. Meanwhile the undetermined-cluster set is dynamically adjusted, to obtain the final clustering result, which prevents the influence of the “uniform effect” on the clustering result. The clustering results of the proposed algorithm is compared with that of K–means, the imbalanced K–means clustering method with multiple centers (MC_IK), and the coefficient of variation clustering for non-uniform data (CVCN) on synthetic and real datasets. The experimental results reveal that the CABON algorithm effectively reduces “uniform effect” generated by the K–means algorithm on imbalanced data, and its clustering result is superior to that of the K–means, MC_IK, and CVCN algorithms.

     

/

返回文章
返回