Agile Intelligent Software Solution for Textual Content Authorship Identification Based on NLP, Artificial Intelligence and Machine Learning

PDF (3532KB), PP.15-66

Views: 0 Downloads: 0

Author(s)

Zhengbing Hu 1 Victoria Vysotska 2,3 Lyubomyr Chyrun 4 Roman Romanchuk 2 Yuriy Ushenko 5,6,* Dmytro Uhryn 6 Cennuo Hu 7

1. School of Computer Science, Hubei University of Technology, Wuhan, China

2. Department of Information Systems and Networks, Lviv Polytechnic National University, Lviv, 79013, Ukraine

3. Osnabrück University, Osnabrück, 49076, Germany

4. Ivan Franko National University of Lviv, Lviv, 79000, Ukraine

5. Department of Physics, Shaoxing University, Shaoxing, Zhejiang Province 312000, China

6. Department of Computer Science, Yuriy Fedkovych Chernivtsi National University, Chernivtsi, 58012, Ukraine

7. Department of Computer Science, College of Science, Purdue University, West Lafayette, IN 47907, USA

* Corresponding author.

DOI: https://doi.org/10.5815/ijmecs.2025.02.02

Received: 16 Jan. 2025 / Revised: 22 Feb. 2025 / Accepted: 18 Mar. 2025 / Published: 8 Apr. 2025

Index Terms

Author's Style, Machine Learning, Authorship Identification, Stylometry, NLP, Artificial Intelligence, Information Technology

Abstract

The main goal of the work is to create an intelligent system that uses NLP methods and machine learning algorithms to analyse and classify textual content authorship. The following machine learning models for English and Ukrainian publications were tested and trained on the dataset: Support Vector Classifier, Random Forest, Naive Bayes, Logistic Regression and Neuron Networks. For English, the accuracy of the models was higher due to the more significant amount of text data available. The results for English fiction publication show that the Neuron Networks classifier outperforms the other models in all evaluated metrics, achieving the highest accuracy (0.97), recall (0.96), F1 score (0.98), and precision (0.96). It shows that Neuron Networks is particularly effective in capturing distinctive features of the writing styles of different English authors in scientific and technical texts. For the Ukrainian language,  there is a drop in accuracy by 5-10% due to the smaller number of corpora of texts for teaching. The results for scientific and technical Ukrainian publications show that the Random Forest classifier outperforms the other models in all evaluated metrics, achieving the highest accuracy (0.88), recall (0.87), F1 score (0.87), and precision (0.87). It shows that Random Forest is particularly effective in capturing distinctive features of the writing styles of different Ukrainian authors in scientific and technical texts. Much worse accuracy results were shown by other models such as Support Vector Classifier (77%), Logistic Regression (73%) and Naive Bayes (70%). The results for the Ukrainian fiction publication show that the Random Forest classifier outperforms the other models in all evaluated metrics, achieving the highest accuracy (0.85), recall (0.84), F1 score (0.84), and precision (0.84). Much worse accuracy results were shown by other models such as Support Vector Classifier (77%), Logistic Regression (73%) and Naive Bayes (70%)

Cite This Paper

Zhengbing Hu, Victoria Vysotska, Lyubomyr Chyrun, Roman Romanchuk, Yuriy Ushenko, Dmytro Uhryn, Cennuo Hu, "Agile Intelligent Software Solution for Textual Content Authorship Identification Based on NLP, Artificial Intelligence and Machine Learning", International Journal of Modern Education and Computer Science(IJMECS), Vol.17, No.2, pp. 15-66, 2025. DOI:10.5815/ijmecs.2025.02.02

Reference

[1]Stamatatos E. A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology. - 2009. -60(3). - pp. 538-556.
[2]Koppel M., Schler J. Authorship verification as a one-class classification problem. Proceedings of the twenty-first international conference on Machine learning. - Banff, Alberta, Canada, 2004.
[3]Daelemans W. Explanation in Computational Stylometry. Proceedings of the 14th International Conference on Computational Linguistics and Intelligent Text Processing. - 2013. -.2. - pp. 451-462.
[4]Manning C.D., Raghavan P., Schu¨tze H. Introduction to Information Retrieval. Cambridge University Press, 2008.
[5]Jurafsky D., Martin J. H. Speech and Language Processing. Pearson Education, 2014. - 1024 p.
[6]Requests in Python. URL: https://docs.python-requests.org/en/v2.9.1/.
[7]Arxiv API. URL: https://info.arxiv.org/help/api/user-manual.html.
[8]PyMuPDF 1.24.4. URL: https://pymupdf.readthedocs.io/en/latest/the-basics.html.
[9]nltk in Python. URL: https://www.nltk.org/api/nltk.html.
[10]Sammut C. (ed). Encyclopedia of Machine Learning. Springer, Boston, MA, 2011. - 1031 c.
[11]sklearn. URL: https://scikit-learn.org/stable/user_guide.html.
[12]Bishop C. M. Pattern Recognition and Machine Learning. Springer, 2006. - 738 c.
[13]Cortes C., Vapnik V.  Support-vector networks. Machine Learning. - 1995. - Вип.20. - С. 273-297.
[14]Murphy K. P. Machine Learning: A Probabilistic Perspective. - MIT Press, 2012. - 1104 c.
[15]A comparison of event models for Naive Bayes text classification.: AAAI Technical Report WS-98-05 URL: https://cdn.aaai.org/Workshops/1998/WS-98-05/WS98-05-007.pdf.
[16]Hosmer D. W., Lemeshow S., Sturdivant R. X. Applied Logistic Regression. John Wiley & Sons, 2013. - 528 c.
[17]Breiman L. Random forests. Machine Learning, 2001. Вип.45. - С. 5-32.
[18]Liaw A., Wiener M.  Classification and regression by RandomForest. Forest. - 2002. - Вип.23. - pp. 18-22.
[19]Powers D. Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation. Journal of Machine Learning Technologies. - 2011. -.2. - С. 37-63.
[20]student-applications/2023-2024/PMP-42/bachelor-thesis/ Melenevych  Roksoliana. URL: https://github.com/kpm-lnu/student-applications/tree/master/2023-2024/PMP-42/bachelorthesis/Melenevych%20Roksoliana..
[21]R. Romanchuk et al., Intellectual Analysis System Project for Ukrainian-language Artistic Works to Determine the Text Authorship Attribution Probability, in: Proceedings of the 18th International Conference on Computer Sciences and Information Technologies, Lviv, 19-21 October 2023.
[22]S.N. Buk, Fundamentals of statistical linguistics. The Publishing Center of LNU was named after I. Franko, 2008. 
[23]M. Bublyk et al., The Decision Tree Usage for the Results Analysis of the Psychophysiological Testing, CEUR workshop proceedings, Vol-2753, 2020, pp. 458-472.
[24]M. Kestemont, et al. "Overview of the cross-domain authorship attribution task at {PAN}," Working Notes of CLEF Conference and Labs of the Evaluation Forum, Switzerland, September 9-12, 2019.
[25]Homer's question. URL: http://febweb.ru/feb/litenc/encyclop/le2/le2-5991.htm 
[26]N. Darchuk, K. Sergii, V. Sorokin, Logico-Linguistic Model of Ukrainian Text, CEUR Workshop Proceedings, Vol-3396, 2023, 20-31.
[27]S. Buk, V. Zhukovska, O. Mosiiuk, Multiparametric profiling of a linguistic construction: linguoquantitative and machine-learning aspects, CEUR Workshop Proceedings 3722 (2024) 236-250.
[28]S. Kubinska, et al. "Ukrainian Language Chatbot for Sentiment Analysis and User Interests Recognition based on Data Mining," CEUR Workshop Proceedings 3171 (2022): 315-327.
[29]Who wrote "Silent Don"? (The problem of the authorship of "Silent Don") / Hyetso H., Gustavsson S., Beckman B., Gil S. – M.: Book, 1989. 
[30]N. Darchuk, and V. Sorokin, "Parameterisation of the Ukrainian Text Corpus Based on Parsing Results," CEUR Workshop Proceedings 3171 (2022): 256-265.
[31]I. Khomytska et al., "Machine Learning and Classical Methods Combined for Text Differentiation," CEUR Workshop Proceedings 3171 (2022): 1107-1116.
[32]I. Khomytska et al., "The statistical parameters of Ivan Franko's authorial style determined by the chi-square test," International Conference on Computer Sciences and Information Technologies (CSIT), 2022, pp.73-76.
[33]I. Khomytska, et al., "Automated Identification of Authorial Styles," CEUR Workshop Proceedings 3396 (2023): 323-333.
[34]Z. Kunch, O. Lytvyn, and I. Mentynska, "Modern Ukrainian Electronic Dictionaries: the Problem of Implementing Spelling Changes," CEUR Workshop Proceedings 3396 (2023): 32-47.
[35]V. Lytvyn et al. "Development of the quantitative method for automated text content authorship attribution based on the statistical analysis of N-grams distribution," Eastern-European Journal of Enterprise Technologies 6(2-102) (2019): 28-51. DOI: 10.15587/1729-4061.2019.186834.
[36]V. Motyka, et al., "Lexical Diversity Parameters Analysis for Author's Styles in Scientific and Technical Publications," CEUR Workshop Proceedings 3403 (2023): 595-617. URL: https://ceur-ws.org/Vol-3403/paper45.pdf.
[37]I. Khomytska et al. "The Multifactor Method Applied for Authorship Attribution on the Phonological Level," CEUR workshop proceedings 2604 (2020): 189-198. 
[38]I. Khomytska et al., "Approach for Minimization Of Phoneme Groups In Authorship Attribution. International Journal of Computing," 19(1), (2020): 55-62.
[39]Z. Haladzhun, et al. "Anti-vaccinationists&Anti-vax": Linguistic Means of Actualizing Assessment in the Headlines and Leads of Ukrainian Text Media, CEUR Workshop Proceedings 3396 (2023): 118-129.
[40]K. Datsyshyn, et al. "Neologisms with the Prefix Anti- in the Ukrainian Online Media in the Covid-19 Pandemic Period," CEUR Workshop Proceedings 3171 (2022): 192-211.
[41]Serkov A. I. "Sorbonneists" and "archivists", or Once again about the authorship of "Romana with cocaine", New Literary Review, 1997, vol. 24, pp. 260–266. 
[42]Ancient Ukrainian lexicography. URL: http://litopys.org.ua/ukrmova/um38.htm 
[43]V. Shyrokov et al. Terminology Dictionary Digitalisation, CEUR Workshop Proceedings, Vol-3171, 2022, pp. 3-15
[44]M. Vakulenko, V. Slyusar, Automatic smart subword segmentation for the reverse Ukrainian physical dictionary task, CEUR Workshop Proceedings 3723 (2024) 59-73. 
[45]A. Dmytriv, et al. "Comparative Analysis of Using Different Parts of Speech in the Ukrainian Texts Based on Stylistic Approach," CEUR Workshop Proceedings 3171 (2022): 546-560.
[46]Text analyser: recognition of authorship. URL: http://habrahabr.ru/post/114187/ 
[47]N. Borysova, et al. "Gender Classification of Surnames: Ukrainian aspect," CEUR Workshop Proceedings 3171 (2022): 354-364.
[48]V. Starko, and A. Rysin, "VESUM: A Large Morphological Dictionary of Ukrainian As a Dynamic Tool," CEUR Workshop Proceedings 3171 (2022): 61-70.
[49]Y. Hlavcheva et al., "Language-independent features for authorship attribution on Ukrainian texts," CEUR Workshop Proceedings 2833 (2021). URL: https://ceur-ws.org/Vol-2833/Paper_13.pdf.
[50]V. Shynkarenko, et al. "Natural Language Texts Authorship Establishing Based on the Sentences Structure," Idea 18 (2022): 19. 
[51]F.A. Acheampong, W. Chen, and H. Nunoo‐Mensah, "Text‐based emotion detection: Advances, challenges, and opportunities," Engineering Reports 2.7 (2020): e12189.
[52]M. Shvedova, N. Prydvorova, and I. Skibina, "Normalisation of Early Modern Ukrainian in GRAC: the Case of Lesia Ukrainka's Works," CEUR Workshop Proceedings 3171 (2022): 71-80.
[53]V. Shynkarenko, and I. Demidovich, "Natural Language Texts Authorship Establishing Based on the Sentences Structure," CEUR Workshop Proceedings 3171 (2022): 328-337
[54]V. Shynkarenko, and I. Demidovich, "Authorship Determination of Natural Language Texts by Several Classes of Indicators with Customizable Weights," CEUR Workshop Proceedings 2870 (2021): 832-844.
[55]H. Antonyuk, L. Chernysh, Diachronical Analysis of Lexicographic Sources of Military Terminology, CEUR Workshop Proceedings, Vol-3171, 2022, pp. 664-676
[56]V. Vysotska, Linguistic intellectual analysis methods for Ukrainian textual content processing, CEUR Workshop Proceedings 3722 (2024) 490-552.
[57]O. Tverdokhlib, et al., Information technology for identifying hate speech in online communication based on machine learning, Lecture Notes on Data Engineering and Communications Technologies 195, 2024, pp. 339–369.
[58]P. Zdebskyi et al., Framework for Improving the Effectiveness of Discussions at English-Language Articles Analysis, in: Proceedings of the IEEE 5th International Conference on Advanced Information and Communication Technologies (AICT), 2023, pp. 127-132. 
[59]N. Khairova, et al., Preface: computational linguistics workshop, CEUR Workshop Proceedings, Vol. 3722, https://ceur-ws.org/Vol-3722/
[60]O. Mediakov et al., Information technology for generating lyrics for song extensions based on transformers, International Journal of Modern Education and Computer Science 16(1), 2024, pp. 23–36.
[61]Yülüce, İbrahim, and Feriştah Dalkılıç. "Author identification with machine learning algorithms." International Journal of Multidisciplinary Studies and Innovative Technologies 6.1 (2022): 45-50.
[62]Misini, Arta, Arbana Kadriu, and Ercan Canhasi. "Authorship classification techniques: Bridging textual domains and languages." International Journal on Information Technologies and Security 16.1 (2024): 27-38.
[63]Hassan, Sayar Ul, Jameel Ahamed, and Khaleel Ahmad. "Analytics of machine learning-based algorithms for text classification." Sustainable Operations and Computers 3 (2022): 238-248.
[64]Alsanoosy, Tawfeeq, Bodor Shalbi, and Ayman Noor. "Authorship Attribution for English Short Texts." Engineering, Technology & Applied Science Research 14.5 (2024): 16419-16426.
[65]Gupta, Sumit, Swarupa Das, and Jyotish Ranjan Mallik. "Machine learning-based authorship attribution using token n-grams and other time tested features." International Journal of Hybrid Intelligent Systems 18.1-2 (2022): 37-51.
[66]Berriche, Lamia, and Souad Larabi-Marie-Sainte. "Unveiling ChatGPT text using writing style." Heliyon 10.12 (2024).
[67]Misini, Arta, Arbana Kadriu, and Ercan Canhasi. "Authorship classification techniques: Bridging textual domains and languages." International Journal on Information Technologies and Security 16.1 (2024): 27-38.
[68]Huang, Baixiang, Canyu Chen, and Kai Shu. "Can large language models identify authorship?." arXiv preprint arXiv:2403.08213 (2024).
[69]Zamir, Muhammad Tayyab, et al. "Stylometry analysis of multi-authored documents for authorship and author style change detection." arXiv preprint arXiv:2401.06752 (2024)