Oleh Prokipchuk; Victoria Vysotska; Petro Pukach; Vasyl Lytvyn; Dmytro Uhryn; Yuriy Ushenko; Zhengbing Hu

Intelligent Analysis of Ukrainian-language Tweets for Public Opinion Research based on NLP Methods and Machine Learning Technology

PDF (1112KB), PP.70-93

Views: 0 Downloads: 0

Author(s)

Oleh Prokipchuk ^1,* Victoria Vysotska ¹ Petro Pukach ¹ Vasyl Lytvyn ¹ Dmytro Uhryn ² Yuriy Ushenko ² Zhengbing Hu ³

1. Lviv Polytechnic National University, Lviv, 79013, Ukraine

2. Yuriy Fedkovych Chernivtsi National University, Chernivtsi, 58012, Ukraine

3. School of Computer Science, Hubei University of Technology, Wuhan, China

* Corresponding author.

DOI: https://doi.org/10.5815/ijmecs.2023.03.06

Received: 13 Jan. 2023 / Revised: 16 Feb. 2023 / Accepted: 19 Mar. 2023 / Published: 8 Jun. 2023

Index Terms

Public Opinion, trend, clustering, stemming, similarity of clusters, Bag of Words, TF-IDF, BERT, K-Means, Agglomerative Hierarchical Clustering

Abstract

The article develops a technology for finding tweet trends based on clustering, which forms a data stream in the form of short representations of clusters and their popularity for further research of public opinion. The accuracy of their result is affected by the natural language feature of the information flow of tweets. An effective approach to tweet collection, filtering, cleaning and pre-processing based on a comparative analysis of Bag of Words, TF-IDF and BERT algorithms is described. The impact of stemming and lemmatization on the quality of the obtained clusters was determined. Stemming and lemmatization allow for significant reduction of the input vocabulary of Ukrainian words by 40.21% and 32.52% respectively. And optimal combinations of clustering methods (K-Means, Agglomerative Hierarchical Clustering and HDBSCAN) and vectorization of tweets were found based on the analysis of 27 clustering of one data sample. The method of presenting clusters of tweets in a short format is selected. Algorithms using the Levenstein Distance, i.e. fuzz sort, fuzz set and Levenshtein, showed the best results. These algorithms quickly perform checks, have a greater difference in similarities, so it is possible to more accurately determine the limit of similarity. According to the results of the clustering, the optimal solutions are to use the HDBSCAN clustering algorithm and the BERT vectorization algorithm to achieve the most accurate results, and to use K-Means together with TF-IDF to achieve the best speed with the optimal result. Stemming can be used to reduce execution time. In this study, the optimal options for comparing cluster fingerprints among the following similarity search methods were experimentally found: Fuzz Sort, Fuzz Set, Levenshtein, Jaro Winkler, Jaccard, Sorensen, Cosine, Sift4. In some algorithms, the average fingerprint similarity reaches above 70%. Three effective tools were found to compare their similarity, as they show a sufficient difference between comparisons of similar and different clusters (> 20%).
The experimental testing was conducted based on the analysis of 90,000 tweets over 7 days for 5 different weekly topics: President Volodymyr Zelenskyi, Leopard tanks, Boris Johnson, Europe, and the bright memory of the deceased. The research was carried out using a combination of K-Means and TF-IDF methods, Agglomerative Hierarchical Clustering and TF-IDF, HDBSCAN and BERT for clustering and vectorization processes. Additionally, fuzz sort was implemented for comparing cluster fingerprints with a similarity threshold of 55%. For comparing fingerprints, the most optimal methods were fuzz sort, fuzz set, and Levenshtein. In terms of execution speed, the best result was achieved with the Levenshtein method. The other two methods performed three times worse in terms of speed, but they are nearly 13 times faster than Sift4. The fastest method is Jaro Winkler, but it has a 19.51% difference in similarities. The method with the best difference in similarities is fuzz set (60.29%). Fuzz sort (32.28%) and Levenshtein (28.43%) took the second and third place respectively. These methods utilize the Levenshtein distance in their work, indicating that such an approach works well for comparing sets of keywords. Other algorithms fail to show significant differences between different fingerprints, suggesting that they are not adapted to this type of task.

Cite This Paper

Oleh Prokipchuk, Victoria Vysotska, Petro Pukach, Vasyl Lytvyn, Dmytro Uhryn, Yuriy Ushenko, Zhengbing Hu, "Intelligent Analysis of Ukrainian-language Tweets for Public Opinion Research based on NLP Methods and Machine Learning Technology", International Journal of Modern Education and Computer Science(IJMECS), Vol.15, No.3, pp. 70-93, 2023. DOI:10.5815/ijmecs.2023.03.06

Reference

[1]Migration movement of the population in January 2014 [Electronic resource] // UKRSTAT.ORG: publication of documents of the State Statistics Service of Ukraine: web site - Access mode: http://ukrstat.gov.ua/operativ/operativ2014/ds/mr/mr_u/mr0114_u.html.
[2]Blahun I.S. Modeling of sustainable development of the region / I.S. Blahun, L.I. Sysak, O.O. Soltysik - Ivano-Frankivsk: Vasyl Stefanyk Precarpathian National University, 2006. 166 p.
[3]Ismail, M. A.; Auda, H. A.; Elzafrany, Y. A. On Time Series Analysis for Repeated Surveys. Journal of Statistical Theory and Applications 2018, 17, 587–596. https://doi.org/10.2991/jsta.2018.17.4.1
[4]Mellon, J.; Prosser, C. Twitter and Facebook are not representative of the general population: Political attitudes and demographics of British social media users. Research & Politics 2017, 4(3). https://doi.org/10.1177/2053168017720008
[5]Han, X; Wang, J; Zhang, M; Wang, X. Using Social Media to Mine and Analyze Public Opinion Related to COVID-19 in China. International Journal of Environmental Research and Public Health 2020, 17(8), 2788. https://doi.org/10.3390/ijerph17082788
[6]Tavoschi, L.; Quattrone, F.; D’Andrea, E.; Ducange, P.; Vabanesi, M.; Marcelloni, F.; Lopalco, P. L. Twitter as a sentinel tool to monitor public opinion on vaccination: an opinion mining analysis from September 2016 to August 2017 in Italy. Human Vaccines & Immunotherapeutics 2020, 16(5), 1062–1069. https://doi.org/10.1080/21645515.2020.1714311
[7]Ainin, S.; Feizollah, A.; Anuar, N. B.; Abdullah, N. A. Sentiment analyses of multilingual tweets on halal tourism. Tourism Management Perspectives 2020, 34, 100658. https://doi.org/10.1016/j.tmp.2020.100658
[8]Boon-Itt, S.; Skunkan, Y. Public Perception of the COVID-19 Pandemic on Twitter: Sentiment Analysis and Topic Modeling Study. JMIR Public Health Surveill 2020, 6(4), e21978. https://doi.org/10.2196/21978
[9]Lwin, M. O.; Lu, J.; Sheldenkar, A.; Schulz, P. J.; Shin, W.; Gupta, R.; Yang, Y. Global Sentiments Surrounding the COVID-19 Pandemic on Twitter: Analysis of Twitter Trends. JMIR Public Health Surveill 2020, 6(2), e19447. https://doi.org/10.2196/19447
[10]Mustakim; Indah, R. N. G.; Novita, R.; Kharisma, O. B.; Vebrianto, R.; Sanjaya, S.; Hasbullah; Andriani, T.; Sari, W.P.; Novita, Y. DBSCAN algorithm: twitter text clustering of trend topic pilkada pekanbaru. Journal of Physics: Conference Series 2019, 1363(1), 012001. https://doi.org/10.1088/1742-6596/1363/1/012001
[11]Twitter API https://developer.twitter.com/en/docs/twitter-api
[12]Moh, T.-S.; Bhagvat, S. Clustering of Technology Tweets and the Impact of Stop Words on Clusters. In Proceedings of the 50th Annual Southeast Regional Conference, Tuscaloosa, Alabama, USA, 29-31 March 2012, pp. 226–231. https://doi.org/10.1145/2184512.2184566
[13]SpaCy https://github.com/explosion/spaCy
[14]Ukrainian-Stopwords https://github.com/skupriienko/Ukrainian-Stopwords
[15]Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2019 https://doi.org/10.48550/arXiv.1810.04805
[16]bert-base-multilingual-uncased https://huggingface.co/bert-base-multilingual-uncased
[17]Sasirekha, K.; Baby, P. Agglomerative hierarchical clustering algorithm-a. International Journal of Scientific and Research Publications 2013, 3(3), 83-85. https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=0c1d9d3279cf15dde398c77a371f2684055f35d2#page=84
[18]McInnes, L.; Healy, J. Accelerated Hierarchical Density Based Clustering. In Proceedings of 2017 IEEE International Conference on Data Mining Workshops (ICDMW), New Orleans, LA, USA, 19-21 November 2017, pp. 33–42. https://doi.org/10.1109/ICDMW.2017.12
[19]Campos, R.; Mangaravite, V.; Pasquali, A.; Jorge, A.; Nunes, C.; Jatowt, A. YAKE! Keyword extraction from single documents using multiple local features. Information Sciences 2020, 509, 257–289. https://doi.org/10.1016/j.ins.2019.09.013
[20]Paice, C. D. An Evaluation Method for Stemming Algorithms. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Dublin, Ireland, 3-6 July 1994, pp. 42–50. https://web.archive.org/web/20060705163430id_/http://widit.slis.indiana.edu/irpub/SIGIR/1994/pdf5.pdf
[21]Stemmers for Ukrainian https://github.com/amakukha/stemmers_ukrainian
[22]Barbaresi, Adrien. Simplemma (v0.9.1). Zenodo 2023. https://doi.org/10.5281/zenodo.7555188
[23]Barbaresi, A.; Hein, K. Data-Driven Identification of German Phrasal Compounds. In: Ekštein, K., Matoušek, V. (eds) Text, Speech, and Dialogue. TSD 2017. Lecture Notes in Computer Science 2017, vol 10415. Springer, Cham. https://doi.org/10.1007/978-3-319-64206-2_22
[24]Barbaresi, A. An Unsupervised Morphological Criterion for Discriminating Similar Languages. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VArDIal3), Osaka, Japan, 12 December 2016, pp. 212–220. https://aclanthology.org/W16-4827
[25]Barbaresi, A. (2016, September). Bootstrapped OCR error detection for a less-resourced language variant. In Proceedings of 13th Conference on Natural Language Processing (KONVENS 2016), Bochum, Germany, September 2016, pp. 21–26. https://hal.science/hal-01371689
[26]Moulavi, D.; Jaskowiak, P. A.; Campello, R. J. G. B.; Zimek, A.; Sander, J. Density-Based Clustering Validation. In Proceedings of the 14th SIAM International Conference on Data Mining (SDM), Philadelphia, PA, USA, 24-26 April 2014, pp. 839–847. https://doi.org/10.1137/1.9781611973440.96
[27]Zhengbing Hu, Igor A. Tereykovskiy, Lyudmila O. Tereykovska, Volodymyr V. Pogorelov, "Determination of Structural Parameters of Multilayer Perceptron Designed to Estimate Parameters of Technical Systems", International Journal of Intelligent Systems and Applications, Vol.9, No.10, pp.57-62, 2017.
[28]Mainych, S. Bulhakova, A., Vysotska, V. Cluster Analysis of Discussions Change Dynamics on Twitter about War in Ukraine, CEUR Workshop Proceedings (2023, Vol-3396, pp. 490-530.
[29]Zhengbing Hu, Yevgeniy V. Bodyanskiy, Nonna Ye. Kulishova, Oleksii K. Tyshchenko, "A Multidimensional Extended Neo-Fuzzy Neuron for Facial Expression Recognition", International Journal of Intelligent Systems and Applications, Vol.9, No.9, pp.29-36, 2017.

International Journal of Modern Education and Computer Science (IJMECS)