Archana Suhas Vaidya; Dipak V. Patil

RSKD Ensemble Classifier with Stable Ensemble Feature Selection for High Dimensional Low Sample Size Cancer Datasets

PDF (1225KB), PP.49-59

Views: 0 Downloads: 0

Author(s)

Archana Suhas Vaidya ^1,* Dipak V. Patil ¹

1. GES’s R H Sapat College of Engineering Management Studies and Research, Nashik 422005 Affiliated to Savitribai Phule Pune University, Pune, Maharashtra, India

* Corresponding author.

DOI: https://doi.org/10.5815/ijitcs.2025.02.05

Received: 5 Oct. 2024 / Revised: 17 Dec. 2024 / Accepted: 5 Feb. 2025 / Published: 8 Apr. 2025

Index Terms

High Dimensional, Low Sample Dataset, Ensemble Feature Selection, Stability Analysis, Ensemble Classifier, Heterogeneous Ensemble

Abstract

This study presents the RSKD ensemble classifier, developed with ensemble feature selection techniques, to address high-dimensional, low-sample-size cancer datasets. Ensemble classifiers are advantageous in such scenarios, offering better classification accuracy than traditional methods by combining multiple models. This combination enhances predictive performance on high-dimensional datasets. However, stability—a key factor for consistent performance on unseen data—often involves a tradeoff with accuracy. Ensemble methods, due to their generalization capabilities, exhibit higher stability, with feature selection stability measured using a consistency index, averaging 65–70%.
The RSKD classifier integrates ensemble feature selection methods SU-R and ChS-R, which enhance feature selection stability and classification accuracy. Its performance was evaluated on seven high-dimensional, low-sample-size datasets and compared against state-of-the-art classifiers, including Adaboost, GradientBoost, REPTree, asBagging_FSS, SRKNN, MF-GE, and eAdaBoost with DSC. The RSKD ensemble classifier achieved an accuracy improvement of 7.69% to 12.35% over these methods. Among the feature selection approaches, SU-R combined with RSKD outperformed ChS-R, demonstrating superior results in cancer prediction tasks.
The findings of this study underscore the potential of RSKD for achieving generalized, robust performance on challenging datasets. By leveraging ensemble classifiers and ensemble feature selection techniques, researchers can address the inherent difficulties of high-dimensional, low-sample-size datasets, enhancing both accuracy and stability. This work provides a valuable foundation for developing diverse, heterogeneous ensemble approaches for cancer prediction and similar applications.

Cite This Paper

Archana Suhas Vaidya, Dipak V. Patil, "RSKD Ensemble Classifier with Stable Ensemble Feature Selection for High Dimensional Low Sample Size Cancer Datasets", International Journal of Information Technology and Computer Science(IJITCS), Vol.17, No.2, pp.49-59, 2025. DOI:10.5815/ijitcs.2025.02.05

Reference

[1]Borja Seijo-Pardo, Verónica Bolón-Canedo, and Amparo Alonso-Betanzos,”Testing Different Ensemble Configurations for Feature Selection”, Neural Process. Lett. 46, 3 (December 2017), pp. 857–880, 2017. https://doi.org/10.1007/s11063-017-9619-1
[2]Barbara Pes, Nicoletta Dessì, Marta Angioni, Exploiting the ensemble paradigm for stable feature selection: A case study on high-dimensional genomic data, Information Fusion, Volume 35, 2017, Pages 132-147, ISSN 1566-2535, https://doi.org/10.1016/j.inffus.2016.10.001.
[3]V. Nikulin, On the homogeneous ensembling via balanced subsets combined with wilcoxon-based feature selection., in: In: Yao J. et al. (eds) Rough Sets and Current Trends in Computing. RSCTC 2012. Lecture Notes in Computer Science, vol 7413., Springer, Berlin, Heidelberg, 2012.
[4]V. Bol´on-Canedo, N. S´anchez-Maro˜no, A. Alonso-Betanzos, Data classification using an ensemble of filters, Neurocomputing 135 (2014) 13–20.
[5]V. Bol´on-Canedo, N. S´anchez-Maro˜no, A. Alonso-Betanzos, An ensemble of filters and classifiers for microarray data classification, Pattern recognition 45 (2012) 531–539.
[6]B. Seijo-Pardo, I. Porto-D´ıaz, V. Bol´on-Canedo, A. Alonso-Betanzos, Ensemble feature selection: Homogeneous and heterogeneous approaches, Knowledge-Based Systems 114 (2017) 124–139.
[7]B. Seijo-Pardo, V. Bol´on-Canedo, A. Alonso-Betanzos, On developing an automatic threshold applied to feature selection ensembles, Information Fusion (2019) 1227–245.
[8]M. Haque, N. Noman, R. Berretta, P. Moscato, Heterogeneous ensemble combination search using genetic algorithm for class imbalanced data classification, PLoS One 11 (2016) e0146116.
[9]D. Guru, M. Suhil, S. Pavithra, G. Priya, Ensemble of feature selection methods for text classification: An analytical study., in: In: Abraham A., Muhuri P., Muda A., Gandhi N. (eds) Intelligent Systems Design and Applications. ISDA 2017. Advances in Intelligent Systems and Computing, vol 736., Springer, Cham,2018.
[10]A. Ben Brahim, M. Limam, Ensemble feature selection for high dimensional data: a new method and a comparative study, Advances in Data Analysis and Classification 1–16doi:10.1007/s11634-017-0285-y.
[11]B. Seijo-Pardo, I. Porto-Díaz, V. Bolón-Canedo, A. Alonso-Betanzos, Ensemble feature selection: Homogeneous and heterogeneous approaches, Knowledge-Based Systems, Volume 118, 2017, pp. 124-139, ISSN 0950-7051, https://doi.org/10.1016/j.knosys.2016.11.017.
[12]F. Moreno-Seco, J. M. Iñesta, P. J. de León, and L. Micó, “Comparison of classifier fusion methods for classification in pattern recognition tasks,” Lecture Notes in Computer Science, pp. 705–713, 2006.
[13]Clark, P. and Boswell, R., “Rule induction with CN2: Some recent improvements.” In Proceedings of the European Working Session on Learning, pp. 151-163, Pitman, 1991.
[14]Brodley, C. E., Automatic selection of split criterion during tree growing based on node selection. In Proceedings of the Twelth International Conference on Machine Learning, 73-80 Taho City, Ca. Morgan Kaufmann, 1995.
[15]Schapire, R.E. The strength of weak learnability. Mach Learn 5, 197–227 (1990). https://doi.org/10.1007/BF00116037
[16]Wolpert, David H. and William G. Macready. “Combining Stacking With Bagging To Improve A Learning Algorithm.” (1996).
[17]A. K. Seewald and J. Fürnkranz, “An Evaluation of Grading Classifiers,” Advances in Intelligent Data Analysis, pp. 115–124, 2001, doi: 10.1007/3-540-44816-0_12.
[18]Chan P. K. and Stolfo, S. J., Toward parallel and distributed learning by metalearning, In AAAI Workshop in Knowledge Discovery in Databases, pp. 227-240, 1993.
[19]Chan P.K. and Stolfo, S.J., A Comparative Evaluation of Voting and Metalearning on Partitioned Data, Proc. 12th Intl. Conf. On Machine Learning ICML-95, 1995.
[20]Chan P.K. and Stolfo S.J, On the Accuracy of Meta-learning for Scalable Data Mining, J. Intelligent Information Systems, 8:5-28, 1997.
[21]Vipin Kumar and Sonajharia Minz. 2015. Multi-view Ensemble Learning: A Supervised Feature Set Partitioning for High Dimensional Data Classification. In Proceedings of the Third International Symposium on Women in Computing and Informatics (WCI '15). Association for Computing Machinery, New York, NY, USA, 31–37. https://doi.org/10.1145/2791405.2791443
[22]Kumar, V., Minz, S. Multi-view ensemble learning: an optimal feature set partitioning for high-dimensional data classification. Knowl Inf Syst 49, 1–59 (2016). https://doi.org/10.1007/s10115-015-0875-y
[23]Lopez-Garcia, P., Masegosa, A.D., Osaba, E. et al. Ensemble classification for imbalanced data based on feature space partitioning and hybrid metaheuristics. Appl Intell 49, 2807–2822 (2019). https://doi.org/10.1007/s10489-019-01423-6
[24]Yongjun Piao, Minghao Piao, Cheng Hao Jin, Ho Sun Shon, Ji-Moon Chung, Buhyun Hwang, Keun Ho Ryu, "A New Ensemble Method with Feature Space Partitioning for High-Dimensional Data Classification", Mathematical Problems in Engineering, vol. 2015, Article ID 590678, 12 pages, 2015. https://doi.org/10.1155/2015/590678
[25]D.B. Skillicorn, S.M. McConnell, Distributed prediction from vertically partitioned data, Journal of Parallel and Distributed Computing, Volume 68, Issue 1, 2008, Pages 16-36, ISSN 0743-7315, https://doi.org/10.1016/j.jpdc.2007.07.009.
[26]Mohammed, A.M., Onieva, E., Woźniak, M. (2020). Vertical and Horizontal Data Partitioning for Classifier Ensemble Learning. In: Burduk, R., Kurzynski, M., Wozniak, M. (eds) Progress in Computer Recognition Systems. CORES 2019. Advances in Intelligent Systems and Computing, vol 977. Springer, Cham. https://doi.org/10.1007/978-3-030-19738-4_10
[27]Kusiak, A., Decomposition in Data Mining: An Industrial Case Study, IEEE Transactions on Electronics Packaging Manufacturing, Vol. 23, No. 4, pp. 345-353, 2000.
[28]Vega-Pons, S., Ruiz-Shulcloper, J. (2009). Clustering Ensemble Method for Heterogeneous Partitions. In: Bayro-Corrochano, E., Eklundh, JO. (eds) Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. CIARP 2009. Lecture Notes in Computer Science, vol 5856. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-10268-4_56
[29]Arkadiy Dushatskiy, Tanja Alderliesten, and Peter A. N. Bosman. 2021. A novel surrogate-assisted evolutionary algorithm applied to partition-based ensemble learning. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO '21). Association for Computing Machinery, New York, NY, USA, 583–591. https://doi.org/10.1145/3449639.3459306
[30]Xiaoli Zhang Fern and Carla E. Brodley. 2004. Solving cluster ensemble problems by bipartite graph partitioning. In Proceedings of the twenty-first international conference on Machine learning (ICML '04). Association for Computing Machinery, New York, NY, USA, 36. https://doi.org/10.1145/1015330.1015414
[31]T. R. Sivapriya, A. R. Nadira Banu Kamal, P. Ranjit Jeba Thangaiah, "Ensemble Merit Merge Feature Selection for Enhanced Multinomial Classification in Alzheimer’s Dementia", Computational and Mathematical Methods in Medicine, vol. 2015, Article ID 676129, 11 pages, 2015. https://doi.org/10.1155/2015/676129
[32]H. Yu and J. Ni, "An Improved Ensemble Learning Method for Classifying High-Dimensional and Imbalanced Biomedicine Data," in IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 11, no. 4, pp. 657-666, July-Aug. 2014, doi: 10.1109/TCBB.2014.2306838.
[33]Dagnew, G. and Shekar, B.,” Ensemble learning-based classification of microarray cancer data on tree-based features”, Cogn. Comput. Syst, 3: 48-60. https://doi.org/10.1049/ccs2.12003 (2021)
[34]S. Dinakaran and P. Ranjit Jeba Thangaiah," Ensemble Method of Effective AdaBoost Algorithm for Decision Tree Classifiers", International Journal on Artificial Intelligence Tools 26(03), https://doi.org/10.1142/S0218213017500075(2017)
[35]Shu-Lin Wang, Xueling Li, Shanwen Zhang, Jie Gui, De-Shuang Huang, “Tumor classification by combining PNN classifier ensemble with neighborhood rough set based gene reduction”, Computers in Biology and Medicine 40(2),pp. 179-189, ISSN 0010-4825, https://doi.org/10.1016/j.compbiomed.2009.11.014 (2010).
[36]Wei, Leyi et al. “A novel hierarchical selective ensemble classifier with bioinformatics application.” Artificial intelligence in medicine vol. 83 (2017): 82-90. doi:10.1016/j.artmed.2017.02.005
[37]Kim, KJ., Cho, SB. Meta-classifiers for high-dimensional, small sample classification for gene expression analysis. Pattern Anal Applic 18, 553–569 (2015). https://doi.org/10.1007/s10044-014-0369-7
[38]M. Paz Sesmero, José Antonio Iglesias, Elena Magán, Agapito Ledezma, Araceli Sanchis, Impact of the learners diversity and combination method on the generation of heterogeneous classifier ensembles, Applied Soft Computing, Volume 111, 2021, 107689, ISSN 1568-4946, https://doi.org/10.1016/j.asoc.2021.107689.
[39]Tien Thanh Nguyen, Mai Phuong Nguyen, Xuan Cuong Pham, Alan Wee-Chung Liew, Heterogeneous classifier ensemble with fuzzy rule-based meta learner, Information Sciences, Volume 422, 2018, pp. 144-160, ISSN 0020-0255, https://doi.org/10.1016/j.ins.2017.09.009.
[40]Pes B. Ensemble feature selection for high-dimensional data: a stability analysis across multiple domains. Neural Comput Appl. 2020;32(10):5951–73, https://doi.org/10.1007/s00521-019-04082-3.
[41]Xiang, Feng, Yulong Zhao, Meng Zhang, Ying Zuo, Xiaofu Zou, and Fei Tao. "Ensemble learning-based stability improvement method for feature selection towards performance prediction." Journal of Manufacturing Systems 74 (2024): 55-67. https://doi.org/10.1016/j.jmsy.2024.03.001
[42]Salman, R., Alzaatreh, A. & Sulieman, H. The stability of different aggregation techniques in ensemble feature selection. J Big Data 9, 51 (2022). https://doi.org/10.1186/s40537-022-00607-1
[43]https://archive.ics.uci.edu/ml/datasets.php
[44]https://csse.szu.edu.cn/staff/zhuzx/Datasets.html
[45]Sumant, A.S., Patil, D. Ensemble Feature Subset Selection: Integration of Symmetric Uncertainty and Chi-Square techniques with RReliefF. J. Inst. Eng. India Ser. B 103, 831–844 (2022). https://doi.org/10.1007/s40031-021-00684-5
[46]Kuncheva LI (2007) A stability index for feature selection. In: 25th IASTED international multi-conference: artificial intelligence and applications, ACTA Press, pp 390–395.
[47]Cannas LM, Dessı` N, Pes B (2013) Assessing similarity of feature selection techniques in high-dimensional domains. Pattern Recognit Lett 34(12):1446–1453
[48]M. Paz Sesmero, José Antonio Iglesias, Elena Magán, Agapito Ledezma, Araceli Sanchis, Impact of the learners diversity and combination method on the generation of heterogeneous classifier ensembles, Applied Soft Computing, Volume 111, 2021, 107689, ISSN 1568-4946, https://doi.org/10.1016/j.asoc.2021.107689.
[49]Tien Thanh Nguyen, Mai Phuong Nguyen, Xuan Cuong Pham, Alan Wee-Chung Liew, Heterogeneous classifier ensemble with fuzzy rule-based meta learner, Information Sciences, Volume 422, 2018, pp. 144-160, ISSN 0020-0255, https://doi.org/10.1016/j.ins.2017.09.009.
[50]Arora, P., Mishra, A., & Malhi, A. (2024). An Ensemble Machine Learning Method Highlights Possible Parkinson’s Disease Genes and Accessing Performance of Re-sampling Techniques. SN Computer Science, 5(5), 483

International Journal of Information Technology and Computer Science (IJITCS)