International Journal of Education and Management Engineering(IJEME)

ISSN: 2305-3623 (Print), ISSN: 2305-8463 (Online)

Published By: MECS Press

IJEME Vol.8, No.4, Jul. 2018

An Intelligent Distributed K-means Algorithm over Cloudera /Hadoop

Full Text (PDF, 935KB), PP.61-70

Views:30   Downloads:3


Tawseef Ayoub Shaikh, Umar Badr Shafeeque, Maksud Ahamad

Index Terms

Big data;Healthcare informatics;MRI (Magnetic Imaging Resonance);Clustering


The 21st century evolved with tsunami of data generation by the human civilization that has delivered new words like Big Data to the world of vocabulary. Digitization process has almost overtaken all the major sectors and it has played a pivotal role of dominance as for as virtual digital world is concerned. This in turn has landed us in most debated term “Big Data” in the present decade. Big Data has made the traditional relational databases (RDMS) handicapped in terms of their huge size and speed of its creation. The hunger to manage and process this gigantic complex heterogeneous data, has again followed the age old rule of “Necessity is the mother of Invention”, and came up with idea of HadoopMapReduce for the same. The given work uses K-Means clustering algorithm on a benchmark MRI dataset from OASIS database, in order to cluster the data based upon their visual similarity, using WEKA. Until a threshold size it worked out and after that compelled WEKA to prompt an emergency message “out of memory” on display. A Map/Reduce version of K-means is implemented on top of Hadoop using R, so as to cure this problem. The given algorithm is evaluated using Speedup, Scale up and Size up parameters and it neatly performed better as the size of the input data gets increased.

Cite This Paper

Tawseef Ayoub Shaikh, Umar Badr Shafeeque, Maksud Ahamad,"An Intelligent Distributed K-means Algorithm over Cloudera /Hadoop", International Journal of Education and Management Engineering(IJEME), Vol.8, No.4, pp.61-70, 2018.DOI: 10.5815/ijeme.2018.04.06


[1]Magnetic Resonance, Functional (fMRI) – Brain, Mar-16-2016, pp: 1-8. [ [Last visited 5/11/2017].

[2]Savitz JB, Rauch SL, Drevets WC, Clinical application of brain imaging for the diagnosis of mood Disorders: the current state of play, Molecular Psychiatry, Nature, Macmillan Publishers Limited, 2013, 18, 528–539.

[3] [Last visited 5/11/2017].

[4]Mistry N, Abdel-Fahim R, Samaraweera A, Mougin O, Tallantyre E, Tench C, Jaspan T, Morris P, Morgan P.S, Evangelou N, Imaging central veins in brain lesions with 3-T T2-weighted magnetic resonance imaging differentiates multiple sclerosis from micro angiopathic brain lesions, Multiple Sclerosis Journal, 2016, 22, 1289-1296.

[5]Nikas JB, Keene CD, Low WC, Comparison of analytical mathematical approaches for identifying key nuclear magnetic resonance spectroscopy biomarkers in the diagnosis and assessment of clinical change of diseases, The Journal of Comparative Neurology, 2010, 518, 4091-4112.

[6]Liu T, Gagan, Agrawal Stratified K-means Clustering Over A Deep Web Data Source, 12th  Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, Beijing, China, 2012, 1113-1121.

[7]David SB, Pál D, Simon HU, Stability of k-means clustering, In Proceedings of the 20th annual conference on Learning theory, COLT’07, Berlin, 2007, 20–34.

[8]Xu Y, Qu W, Li Z, Min G, Li K, Liu Z, Efficient ??-means++ Approximation with MapReduce, IEEE Transactions on parallel and distributed systems, 2013, 10,1-10.

[9]Mao Y, Xu Z, Ping P, Wang L, An Optimal Distributed K-Means Clustering Algorithm Based on CloudStack, Proceedings 2015 Ninth International Conference on Frontier of Computer Science and Technology, China, 2015.

[10]Yu TK, Lee DT, Chang SM, Zhan J, Multi-party k-Means Clustering with Privacy Consideration, International Symposium on Parallel and Distributed Processing with Applications, Taiwan, 2010. 

[11]Zhao W, Ma H, He Q, Parallel K-Means Clustering Based on MapReduce, in 09th Proceedings of the 1st International Conference on Cloud Computing, Springer Beijing, China, 2009, 674-679.

[12]Daniel SM, Anthony FF, John GC, John GM, Randy BL, Open Access Series of Imaging Studies (OASIS): Longitudinal MRI Data in No demented and Demented Older Adults, Journal of Cognitive Neuroscience, MIT Press, 2010, 22, 2677-2684.

[13] [Last visited 5/11/2017].

[14]Elizabeth MS, Russell TS ,Shiee N, Farrah JM, Avni AC, Jennifer LC, Peter AC, Dzung LP, Daniel SR, Ciprian MC, OASIS is Automated Statistical Inference for Segmentation, with applications to multiple sclerosis lesion segmentation in MRI, NeuroImage: Clinical, Elsevoir, 2013, 15,  402–413.

[15]Daniel SM, Tracy HW, Parker J, John GC, Randy LB, Open Access Series of Imaging Studies (OASIS): Cross-sectional MRI Data in Young, Middle Aged, Nodemented and Demented Older Adults, Massachusetts Institute of Technology Journal of Cognitive Neuroscience, 2007, 19, 1498–1507.

[16]Garner R, WEKA: The Waikato Environment for Knowledge Analysis, Hamilton, Proceedings of the 1995 New Zealand Computer Science Research Students Conference, Stephen Department of Computer Science, University of Waikato, 1995, 57–64.

[17]Dean J, Ghemawat S, MapReduce: Simplified Data Processing on Large Clusters, Communications of the ACM - 50th anniversary, 2008, 51, 107-113.

[18]Dittrich J, Arnulfo J, Ruiz Q, Efficient Big Data Processing in Hadoop MapReduce, Proceedings of the VLDB Endowment VLDB Endowment, 2012, 5, 2014-2021. 

[19]Oancea B, Dragoescu RM, Romama R, Integrating R and Hadoop for Big Data Analysis, Statistica, 2014, 2, 83-94.

[20]Kumar S, Singh P, Rani S, Sentimental analysis of social media using R language and Hadoop, 5th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO), 2016, 7, 207-213.