Resampling algorithm of imbalanced data based on neighbor relationship
-
摘要: 为了提高非平衡数据集的分类精度,提出了一种基于样本空间近邻关系的重采样算法。该方法首先根据数据集中少数类样本的空间近邻关系进行安全级别评估,根据安全级别有指导的进行SMOTE升采样;然后对多数类样本依据其空间近邻关系计算局部密度,从而对多数类样本密集区域进行降采样处理。通过以上两种手段可以均衡测试数据集,并控制数据规模防止过拟合,实现对两类样本分类的均衡化。采用十折交叉验证的方式产生训练集和测试集,在对训练集重采样之后,以超限学习机作为分类器进行训练,并在测试集上进行验证。在UCI非平衡数据集和电路故障诊断实测数据上的实验结果表明,所提方法在整体上优于其他重采样算法。Abstract: The classification of imbalanced data has become a critical and significant research issue in many data-intensive applications. In order to improve the classification accuracy of imbalanced data sets, a resampling algorithm based on the neighbour relationship (RSNR) of sample space is proposed. This method firstly evaluates the security level according to the spatial neighbour relations of minority samples, and oversamples them through SMOTE algorithm guided by their security level. Then, the local density of majority samples is calculated according to their spatial neighbour relation, so as to under-sample the majority samples in sample-intensive area. By the above two means, the data set can be balanced, and the data size can be controlled to prevent overfitting, so as to realize the classification equalization of the two categories. The training set and test set were generated by the method of 5×10 fold cross validation. After resampling the training set, the Extreme Learning Machine (ELM) was used as the classifier for training, and the test set was used for verification. The experimental results on UCI imbalanced data set and measured circuit fault diagnosis data show that the proposed method is superior to other resampling algorithms.
-
Key words:
- imbalanced data /
- neighbour relationship /
- resample /
- local density /
- classification
-

计量
- 文章访问数: 729
- HTML全文浏览量: 133
- PDF下载量: 14
- 被引次数: 0