Experiment of Document Clustering by Triple-pass Leader-follower Algorithm without Any Information on Threshold of Similarity (データ工学) Experiment of Document Clustering by Triple-pass Leader-follower Algorithm without Any Information on Threshold of Similarity
Search this Article
The number of clusters has to be defined a priori in most clustering algorithms, but it is usually unknown in situations to which document clustering is applied. Therefore, it would be convenient if a clustering algorithm could be executed without any information on the number of clusters. This article attempts to develop such an al-gorithm by extending the leader-follower clustering algorithm, which is appropriate for the clustering of large-scale datasets. Specifically, a threshold value required for executing the leader-follower clustering algorithm is automatically estimated from some pairs of documents by scanning the document file one time before executing the standard leader-follower algorithm. In particular, the triple-pass algorithm in which cluster vectors are generated in the second scan and each document is allocated to the most similar cluster in the third scan is proposed. The experimental result suggests that the triple-pass leader-follower clustering algorithm is sufficiently effective and comparable with the hierarchical Dirichlet process (HDP) mixture model and with the spherical k-means algorithm with automatically estimating the number of clusters based on the cover-coefficient. The algorithm requires less computational iteration than the other two methods, and is thus cost effective.
- IEICE technical report. Data engineering
IEICE technical report. Data engineering 113(150), 127-132, 2013-07-22
The Institute of Electronics, Information and Communication Engineers