Work place: University of Engineering & Management, Kolkata, West Bengal, India
E-mail: anirban-das@live.com
Website:
Research Interests:
Biography
By Avijit Kumar Chaudhuri Arkadip Ray Dilip K. Banerjee Anirban Das
DOI: https://doi.org/10.5815/ijisa.2021.05.05, Pub. Date: 8 Oct. 2021
Cervical cancer is the fourth most prevalent cancer in women which has claimed 3,41,831 lives and accounted for 6,04,127 new cases in 2020 worldwide. To reduce such a vast mortality rate, early detection of the disease is essential. A fast, accurate, and interpretable machine learning model is a research subject. Fewer features reduce the computational effort and improve interpretation. A 3-Stage Hybrid feature selection approach and a Stacked Classification model are evaluated on the cervical cancer dataset obtained from the UCI Machine Learning Repository with 35 features and one outcome variable. Stage-1 uses a Genetic Algorithm and Logistic Regression Architecture for Feature Selection and selects twelve features well correlated with the class but not among themselves. Stage-2 utilizes the same Genetic Algorithm and Logistic Regression Architecture for Feature Selection to select five features. In Stage-3, Logistic Regression (LR), Naïve Bayes (NB), Support Vector Machine (SVM), Extra Trees (ET), Random Forest (RF), and Gradient Boosting (GDB) are used with the five features to identify patients with or without cancer. Data splitting, several metrics, and statistical tests are used, along with 10-fold cross validation, to do a comparative analysis. LR, NB, SVM, ET, RF, and GDB demonstrate improvement across performance measures by reducing the number of features to five. In the 66-34 split, all five machine learning methods except NB recorded 97% accuracy with 5 features. Also, the Stacked model produced higher than 96% accuracy with five features in 66-34 and 80-20 splits, and in 10-fold cross validation. Various performance aggregators have shown improved results with reduced features when compared to previous studies. Finally, with approximately 100% performance in classification results, the suggested ensemble model showed its promise. The output results were compared to those of other studies on the same dataset, and the proposed classifiers were found to be the most effective across all performance dimensions.
[...] Read more.By Avijit Kumar Chaudhuri Dilip K. Banerjee Anirban Das
DOI: https://doi.org/10.5815/ijisa.2021.04.03, Pub. Date: 8 Aug. 2021
World Health Organisation declared breast cancer (BC) as the most frequent suffering among women and accounted for 15 percent of all cancer deaths. Its accurate prediction is of utmost significance as it not only prevents deaths but also stops mistreatments. The conventional way of diagnosis includes the estimation of the tumor size as a sign of plausible cancer. Machine learning (ML) techniques have shown the effectiveness of predicting disease. However, the ML methods have been method centric rather than being dataset centric. In this paper, the authors introduce a dataset centric approach(DCA) deploying a genetic algorithm (GA) method to identify the features and a learning ensemble classifier algorithm to predict using the right features. Adaboost is such an approach that trains the model assigning weights to individual records rather than experimenting on the splitting of datasets alone and perform hyper-parameter optimization. The authors simulate the results by varying base classifiers i.e, using logistic regression (LR), decision tree (DT), support vector machine (SVM), naive bayes (NB), random forest (RF), and 10-fold cross-validations with a different split of the dataset as training and testing. The proposed DCA model with RF and 10-fold cross-validations demonstrated its potential with almost 100% performance in the classification results that no research could suggest so far. The DCA satisfies the underlying principles of data mining: the principle of parsimony, the principle of inclusion, the principle of discrimination, and the principle of optimality. This DCA is a democratic and unbiased ensemble approach as it allows all features and methods in the start to compete, but filters out the most reliable chain (of steps and combinations) that give the highest accuracy. With fewer characteristics and splits of 50-50, 66-34, and 10 fold cross-validations, the Stacked model achieves 97 % accuracy. These values and the reduction of features improve upon prior research works.
Further, the proposed classifier is compared with some state-of-the-art machine-learning classifiers, namely random forest, naive Bayes, support-vector machine with radial basis function kernel, and decision tree. For testing the classifiers, different performance metrics have been employed – accuracy, detection rate, sensitivity, specificity, receiver operating characteristic, area under the curve, and some statistical tests such as the Wilcoxon signed-rank test and kappa statistics – to check the strength of the proposed DCA classifier. Various splits of training and testing data – namely, 50–50%, 66–34%, 80–20% and 10-fold cross-validation – have been incorporated in this research to test the credibility of the classification models in handling the unbalanced data. Finally, the proposed DCA model demonstrated its potential with almost 100% performance in the classification results. The output results have also been compared with other research on the same dataset where the proposed classifiers were found to be best across all the performance dimensions.
Subscribe to receive issue release notifications and newsletters from MECS Press journals