Enhanced Phishing URLs Detection using Feature Selection and Machine Learning Approaches

PDF (785KB), PP.48-67

Views: 0 Downloads: 0

Author(s)

Dharmaraj R. Patil 1,* Rajnikant B. Wagh 1 Vipul D. Punjabi 1 Shailendra M. Pardeshi 1

1. Department of Computer Engineering, R.C.Patel Institute of Technology, Shirpur, India

* Corresponding author.

DOI: https://doi.org/10.5815/ijwmt.2024.06.04

Received: 10 Jun. 2024 / Revised: 15 Aug. 2024 / Accepted: 17 Oct. 2024 / Published: 8 Dec. 2024

Index Terms

Phishing Website Detection, Web Security, Feature Selection, Machine Learning, Cyber Security

Abstract

Phishing threats continue to compromise online security by using deceptive URLs to lure users and extract sensitive information. This paper presents a method for detecting phishing URLs that employs optimal feature selection techniques to improve detection system accuracy and efficiency. The proposed approach aims to enhance performance by identifying the most relevant features from a comprehensive set and applying various machine learning algorithms, including Decision Trees, XGBoost, Random Forest, Extra Trees, Logistic Regression, AdaBoost, and K-Nearest Neighbors. Key features are selected from an extensive feature set using techniques such as information gain, information gain ratio, and chi-square (χ2). Evaluation results indicate promising outcomes, with the potential to surpass existing methods. The Extra Trees classifier, combined with the chi-square feature selection method, achieved an accuracy, precision, recall, and F-measure of 98.23% using a subset of 28 features out of a total of 48. Integrating optimal feature selection not only reduces computational demands but also enhances the effectiveness of phishing URL detection systems.

Cite This Paper

Dharmaraj R. Patil, Rajnikant B. Wagh, Vipul D. Punjabi, Shailendra M. Pardeshi, "Enhanced Phishing URLs Detection using Feature Selection and Machine Learning Approaches", International Journal of Wireless and Microwave Technologies(IJWMT), Vol.14, No.6, pp. 48-67, 2024. DOI:10.5815/ijwmt.2024.06.04

Reference

[1]Safi A, Singh S., A systematic literature review on phishing website detection techniques. Journal of King Saud University-Computer and Information Sciences. 2023 Jan 11.
[2]Sonowal G, Kuppusamy KS. PhiDMA–A phishing detection model with multi-filter approach. Journal of King Saud University-Computer and Information Sciences. 2020 Jan 1;32(1):99-112.
[3]Li T, Kou G, Peng Y. Improving malicious URLs detection via feature engineering: Linear and nonlinear space transformation methods. Information Systems. 2020 Jul 1; 91:101494.
[4]Orunsolu AA, Sodiya AS, Akinwale AT. A predictive model for phishing detection. Journal of King Saud University- Computer and Information Sciences. 2022 Feb 1;34(2):232-47.
[5]Salahdine F, El Mrabet Z, Kaabouch N. Phishing Attacks Detection a Machine Learning-Based Approach. In 2021 IEEE 12th Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON) 2021. Dec 1 (pp. 0250-0255). IEEE.
[6]Al-Ahmadi S. PDMLP: phishing detection using multilayer perceptron. International Journal of Network Security & Its Applications (IJNSA). Vol. 2020;12.
[7]Wang W, Zhang F, Luo X, Zhang S. PDRCNN: Precise phishing detection with recurrent convolutional neural networks. Security and Communication Networks. 2019 Oct 29; 2019:1-5.
[8]Yi P, Guan Y, Zou F, Yao Y, Wang W, Zhu T. Web phishing detection using a deep learning framework. Wireless Communications and Mobile Computing. 2018 Sep 26;2018.
[9]Rahim, M. S., Anwar, F., Saeed, M., & Saleem, K. Phishing detection based on features selection using rough set theory. PLoS One. 2019, 14(8), e0218167.
[10]Verma, A., & Ko, R. K. L. Machine learning models for phishing website detection: A survey. Journal of Network and Computer Applications. 2020, 166, 102687.
[11]Alazab, M., Layton, R., & Jurdak, R. Phishing in international waters: Exploring the cross-national patterns of phishing delivery and victimization. Computers & Security. 2018, 77, 165-175.
[12]Mohammad, A., Thabtah, F., & McCluskey, T. Phishing detection: A recent review and new approaches. Journal of Information Security and Applications. 2017, 38, 102-120.
[13]APWG Phishing Activity Trends Report 2nd quarter 2023. Available online, https://docs.apwg.org/ reports/apwg_trends_report_q2_2023.pdf?.
[14]Vijayalakshmi M, Mercy Shalinie S, Yang MH, U RM. Web phishing detection techniques: a survey on the state-of- the-art, taxonomy and future directions. IET Networks. 2020 Sep;9(5):235-46.
[15]Catal C, Giray G, Tekinerdogan B, Kumar S, Shukla S. Applications of deep learning for phishing detection: a systematic literature review. Knowledge and Information Systems. 2022 Jun;64(6):1457-500.
[16]Basit A, Zafar M, Liu X, Javed AR, Jalil Z, Kifayat K. A comprehensive survey of AI-enabled phishing attacks detection techniques. Telecommunication Systems. 2021 Jan; 76:139-54.
[17]Prabakaran MK, Meenakshi Sundaram P, Chandrasekar AD. An enhanced deep learning-based phishing detection mechanism to effectively identify malicious URLs using variational autoencoders. IET Information Security. 2023 May;17(3):423-40.
[18]Li T, Kou G, Peng Y. Improving malicious URLs detection via feature engineering: Linear and nonlinear space transformation methods. Information Systems. 2020 Jul 1; 91:101494.
[19]Yuan J, Liu Y, Yu L. A novel approach for malicious URL detection based on the joint model. Security and Communication Networks. 2021 Dec 13; 2021:1-2.
[20]Alsariera YA, Adeyemo VE, Balogun AO, Alazzawi AK. Ai meta-learners and extra-trees algorithm for the detection of phishing websites. IEEE access. 2020 Aug 3; 8:142532-42.
[21]El Aassal A, Baki S, Das A, Verma RM. An in-depth benchmarking and evaluation of phishing detection research for security needs. IEEE Access. 2020 Jan 28; 8:22170-92.
[22]Alnemari S, Alshammari M. Detecting phishing domains using machine learning. Applied Sciences. 2023 Apr 7;13(8):4649.
[23]Chen YH, Chen JL. Ai@ ntiphish—machine learning mechanisms for cyber-phishing attack. IEICE Transactions on Information and Systems. 2019 May 1;102(5):878-87.
[24]Aljofey A, Jiang Q, Qu Q, Huang M, Niyigena JP. An effective phishing detection model based on character level convolutional neural network from URL. Electronics. 2020 Sep 15;9(9):1514.
[25]Alshingiti Z, Alaqel R, Al-Muhtadi J, Haq QE, Saleem K, Faheem MH. A Deep Learning-Based Phishing Detection System Using CNN, LSTM, and LSTM-CNN. Electronics. 2023 Jan 3;12(1):232.
[26]Yang P, Zhao G, Zeng P. Phishing website detection based on multidimensional features driven by deep learning. IEEE access. 2019 Jan 11; 7:15196-209.
[27]Calzarossa MC, Giudici P, Zieni R. Explainable machine learning for phishing feature detection. Quality and Reliability Engineering International. 2023.
[28]Mao J, Bian J, Tian W, Zhu S, Wei T, Li A, Liang Z. Phishing page detection via learning classifiers from page layout feature. EURASIP Journal on Wireless Communications and Networking. 2019 Dec;2019(1):1-4.
[29]Yang R, Zheng K, Wu B, Wu C, Wang X. Phishing website detection based on deep convolutional neural network and random forest ensemble learning. Sensors. 2021 Dec 10;21(24):8281.
[30]Mosa DT, Shams MY, Abohany AA, El-kenawy ES, Thabet M. Machine Learning Techniques for Detecting Phishing URL Attacks. CMC-COMPUTERS MATERIALS & CONTINUA. 2023 Jan 1;75(1):1271-90.
[31]Phishing Dataset – A Phishing and Legitimate Dataset for Rapid Benchmarking. Available online, https://www. fcsit.unimas.my/phishing-dataset.
[32]Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH. The WEKA data mining software: an update. ACM SIGKDD explorations newsletter. 2009 Nov 16;11(1):10-8.
[33]Qu K, Xu J, Hou Q, Qu K, Sun Y. Feature selection using Information Gain and decision information in neighbor- hood decision system. Applied Soft Computing. 2023 Mar 1; 136:110100.
[34]Prasetiyo B, Muslim MA, Baroroh N. Evaluation of feature selection using information gain and gain ratio on bank marketing classification using naive bayes. In Journal of physics: conference series 2021. Jun 1 (Vol. 1918, No. 4, p. 042153). IOP Publishing.
[35]Zhai Y, Song W, Liu X, Liu L, Zhao X. A chi-square statistics based feature selection method in text classification. In 2018 IEEE 9th International conference on software engineering and service science (ICSESS) 2018. Nov 23 (pp. 160-163). IEEE.
[36]scikit-learn Machine Learning in Python. Available online,https://scikit-learn.org/stable/.
[37]Rokach L, Maimon O. Decision trees. Data mining and knowledge discovery handbook. 2005:165-92.
[38]Breiman L. Random forests. Machine learning. 2001 Oct; 45:5-32.
[39]Geurts P, Ernst D, Wehenkel L. Extremely randomized trees. Machine learning. 2006 Apr; 63:3-42.
[40]Chen T, Guestrin C. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining 2016. Aug 13 (pp. 785-794).
[41]Schapire RE. Explaining AdaBoost. In Empirical Inference: Festschrift in Honor of Vladimir N. Vapnik 2013. Oct 9 (pp. 37-52). Berlin, Heidelberg: Springer Berlin Heidelberg.
[42]Stoltzfus JC. Logistic regression: a brief primer. Academic emergency medicine. 2011 Oct;18(10):1099-104.
[43]Peterson LE. K-nearest neighbor. Scholarpedia. 2009 Feb 21;4(2):1883.
[44]Sokolova M, Lapalme G. A systematic analysis of performance measures for classification tasks. Information processing & management. 2009 Jul 1;45(4):427-37.
[45]Zhang Y, Hong JI, Cranor LF. Cantina: a content-based approach to detecting phishing web sites. In Proceedings of the 16th international conference on World Wide Web 2007. May 8 (pp. 639-648).
[46]Xiang G, Hong J, Rose CP, Cranor L. Cantina+ a feature-rich machine learning framework for detecting phishing web sites. ACM Transactions on Information and System Security (TISSEC). 2011 Sep 1;14(2):1-28.
[47]Jain AK, Gupta BB. A novel approach to protect against phishing attacks at client side using auto-updated white-list. EURASIP Journal on Information Security. 2016 Dec; 2016:1-1.
[48]Xiang G, Hong JI. A hybrid phish detection approach by identity discovery and keywords retrieval. In Proceedings of the 18th international conference on World wide web 2009. Apr 20 (pp. 571-580).
[49]Varshney G, Misra M, Atrey PK. A phish detector using lightweight search features. Computers & Security. 2016 Sep 1; 62:213-28.
[50]Marchal S, François J, State R, Engel T. PhishStorm: Detecting phishing with streaming analytics. IEEE Transactions on Network and Service Management. 2014 Dec 4;11(4):458-71.
[51]Rao RS, Ali ST. PhishShield: a desktop application to detect phishing webpages through heuristic approach. Procedia Computer Science. 2015 Jan 1; 54:147-56.
[52]Zhu E, Chen Y, Ye C, Li X, Liu F. OFS-NN: an effective phishing websites detection model based on optimal feature selection and neural network. IEEE Access. 2019 Jun 4; 7:73271-84.