武森, 冯小东, 杨杰, 张晓楠. 基于MapReduce的大规模文本聚类并行化[J]. 工程科学学报, 2014, 36(10): 1411-1419. DOI: 10.13374/j.issn1001-053x.2014.10.019
引用本文: 武森, 冯小东, 杨杰, 张晓楠. 基于MapReduce的大规模文本聚类并行化[J]. 工程科学学报, 2014, 36(10): 1411-1419. DOI: 10.13374/j.issn1001-053x.2014.10.019
WU Sen, FENG Xiao-dong, YANG Jie, ZHANG Xiao-nan. Parallel clustering of very large document datasets with MapReduce[J]. Chinese Journal of Engineering, 2014, 36(10): 1411-1419. DOI: 10.13374/j.issn1001-053x.2014.10.019
Citation: WU Sen, FENG Xiao-dong, YANG Jie, ZHANG Xiao-nan. Parallel clustering of very large document datasets with MapReduce[J]. Chinese Journal of Engineering, 2014, 36(10): 1411-1419. DOI: 10.13374/j.issn1001-053x.2014.10.019

基于MapReduce的大规模文本聚类并行化

Parallel clustering of very large document datasets with MapReduce

  • 摘要: 建立快速有效的针对大规模文本数据的聚类分析方法是当前数据挖掘研究和应用领域中的一个热点问题.为了同时保证聚类效果和提高聚类效率,提出基于"互为最小相似度文本对"搜索的文本聚类算法及分布式并行计算模型.首先利用向量空间模型提出一种文本相似度计算方法;其次,基于"互为最小相似度文本对"搜索选择二分簇中心,提出通过一次划分实现簇质心寻优的二分K-means聚类算法;最后,基于MapReduce框架设计面向云计算应用的大规模文本并行聚类模型.在Hadoop平台上运用真实文本数据的实验表明:提出的聚类算法与原始二分K-means相比,在获得相当聚类效果的同时,具有明显效率优势;并行聚类模型在不同数据规模和计算节点数目上具有良好的扩展性.

     

    Abstract: To develop fast and efficient methods to cluster mass document data is one of the hot issues of current data mining research and applications. In order to ensure the clustering result and simultaneously improve the clustering efficiency, a document clustering algorithm was proposed based on searching a document pair with minimum similarity for each other and its distributed parallel computing models were provided. Firstly a document similarity measure was presented using a vector space model (VSM); then bisecting clustering was raised combining the bisecting K-means and the proposed initial cluster center selection approach to find the optimized cluster centroids by once partitioning; finally a distributed parallel document clustering model was designed for cloud computing based on MapReduce framework. Experiments on Hadoop platform, using real document datasets, showed the obvious efficiency advantages of the novel document clustering algorithm compared to the original bisecting K-means with an equivalent clustering result, and the scalability of parallel clustering with different data sizes and different computation node numbers was also evaluated.

     

/

返回文章
返回