IJISA Vol. 9, No. 10, 8 Oct. 2017
Cover page and Table of Contents: PDF (size: 700KB)
Full Text (PDF, 700KB), PP.40-48
Views: 0 Downloads: 0
K-means, k-mean*, kernel density, seed value, external and internal validity measures
Data clustering is a basic technique to show the structure of a data set. K-means clustering is a widely acceptable method of data clustering, which follow a partitioned approach for dividing the given data set into non-overlapping groups. Unfortunately, it has the pitfall of randomly choosing the initial cluster centers. Due to its gradient nature, this algorithm is highly sensitive to the initial seed value. In this paper, we propose a kernel density-based method to compute an initial seed value for the k-means algorithm. The idea is to select an initial point from the denser region because they truly reflect the property of the overall data set. Subsequently, we are avoiding the selection of outliers as an initial seed value. We have verified the proposed method on real data sets with the help of different internal and external validity measures. The experimental analysis illustrates that the proposed method has better performance over the k-means, k-means++ algorithm, and other recent initialization methods.
Ajay Kumar, Shishir Kumar, "Density Based Initialization Method for K-Means Clustering Algorithm", International Journal of Intelligent Systems and Applications(IJISA), Vol.9, No.10, pp.40-48, 2017. DOI:10.5815/ijisa.2017.10.05
[1]A. K. Jain and R.C. Dubes, “Algorithms for clustering data,” Prentice Hall, 1988.
[2]A. Kumar and S. Kumar, “Experimental Analysis of Sequential Clustering Algorithm on Big Data,” The IIOABJ, 2016, Vol.7 (11), pp. 160-170.
[3]Hamed Shah-Hosseini, “Multilevel Thresholding for Image Segmentation using the Galaxy-based Search Algorithm,” I.J. Intelligent Systems and Applications, 2013, 11, 19-33.
[4]A. Kumar and S. Kumar, “Color Image Segmentation via Improved K-means Algorithm”, International Journal of Advanced Computer Science and Application, 2016, Vol.7, No.3, pp. 46-53.
[5]E. Hadavandi, S. Hassan and G. Arash, “Integration of genetic fuzzy systems and artificial neural networks for stock price forecasting,” Knowledge-Based System, 2010, pp. 800-808.
[6]M. M. Rahman, “Mining social data to extract intellectual knowledge,” IJISA, 2012, Vol.4, no.10, pp. 15-24.
[7]A. Kumar, S. Kumar and S. Saxena, “An efficient approach for incremental association rule mining through histogram matching technique,” International journal of information retrieval research,” 2012, 2(2), pp.29-42.
[8]S. Kalyani and K. Swarup, “Particle swarm optimization based K-means clustering approach for security assessment in power system,” Expert System with Applications, 2011, Vol. 38, pp. 10839-10846.
[9]F. Murtagh and P. Contreras, “Algorithms for hierarchical clustering: an overview,” WIREs Data Mining Knowledge Discovery, 2012, Vol. 2, pp. 86-97.
[10]J. Sander, M. Ester, H.P. Kriegel and X. Xu, “Density-based clustering in spatial databases: the algorithm GDBSCAN and its application,” Data Mining and Knowledge Discovery, 1998, Vol. 2, pp. 169-194.
[11]L. Kaufman and P. J. Rousseeuw, “Finding Groups in Data,” John Wiley & Sons, 1983.
[12]Y. Xiao and J. Yu, “Partitive clustering (K-means family),” WIREs Data Mining Knowledge Discovery, 2012, Vol. 2, pp. 209-225.
[13]D. Arthur and S. Vassilvitskii, “K-means++: the advantages of careful seeding,” In: Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete algorithms, Philadelphia, PA: Society for Industrial and Applied Mathematics, 2007, pp. 1027–1035.
[14]P. S. Bradley and U.M. Fayyad, “Refining initial points for K-means clustering,” In: Proceedings of the 15th International Conference on Machine Learning, San Francisco, CA: Morgan Kaufmann, 1998, pp. 91–99.
[15]A. Likas, N. Vlassis and J. Jakob, “The global k-means,” Pattern Recognition, 2003, Vol. 36, pp. 451-461.
[16]S. S. Khan and A. Ahmad, “Cluster center initialization algorithm for K-means clustering,” Patter Recognition Letter, 2004, Vol. 25, pp.1293–1302.
[17]B. Mirkin, “Clustering for data mining: A data recovery approach,” Chapman and Hall, 2005.
[18]S. Deelers and S. Auwantanamongkol, “Enhancing K-means algorithm with initial cluster centers derived from data partitioning along the data axis with the highest variance,” International Journal of Electrical and Computer Science, 2007, Vol. 2, pp. 247-252.
[19]M. Belal and A. Daoud, “A New algorithm for Cluster Initialization,” World Academy of Science, Engineering and Technology, 2007, Vol. 4, pp. 568-570.
[20]R. Maitra, “Initializing partition-optimization algorithm,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2009, Vol. 6, pp. 144-157.
[21]E. Murat, C. Nazif and S. Sadullah, “A new algorithm for initial cluster centers in k-means algorithm,” Pattern Recognition Letters, 2010, Vol. 32, pp. 1701-1705.
[22]F. Khan, “An initial seed selection algorithm for k-means clustering of georeferenced data to improve replicability of cluster assignments for mapping application,” Applied Soft Computing, 2012, Vol.12, pp. 3698-3700.
[23]M. I. Radu, R. Mariescu-Istodor and F. Pasi, “K-means*: Clustering by gradual data transformation,” Pattern Recognition, 2014, Vol. 47, pp. 3376-3386.
[24]C. Piciarelli, C. Micheloni and G. L. Foresti, “Kernel-based clustering,” Electronics Letters, 2013, Vol. 49, pp. 113-114.
[25]C. Piciarelli, C. Micheloni and G.l. Foresti, “Kernel-based unsupervised trajectory clusters discovery,” In: 8th International Workshop on Visual Surveillance, In conjunction with ECCV'08, France, October 17, 2008.
[26]H. Läuter and B.W. Silverman, “Density Estimation for Statistics and Data Analysis," Chapman & Hall, New York, 1986.
[27]R. M. Aliguliyev, “Performance evaluation of density-based clustering methods,” Information Sciences, 2009, Vol. 179, pp. 3583-3602.
[28]B. Mirkin, “Mathematical Classification and clustering,” Springer, 1996.
[29]J. Qiao and Y. Lu, “A new algorithm for choosing initial cluster centers for k-means,” In: Proceedings of the 2nd International Conference on Computer Science and Electronics Engineering, 2013.
[30]K.Y. Yeung and W. L. Ruzzo, “An empirical study on principal component analysis for clustering gene expression data,” Bioinformatics, 2001, Vol. 17, pp. 763-774.
[31]UC Irvine Machine Learning Data Set Repository, http://archive.ics.uci.edu/ml/datasets.html (accessed Jan 2016)