Modelling Misinformation in Swahili-English Code-switched Texts

PDF (837KB), PP.67-80

Views: 0 Downloads: 0

Author(s)

Cynthia Amol 1,* Lilian Wanzare 1 James Obuhuma 1

1. Maseno University, Department of Computer Science, Maseno, Kenya

* Corresponding author.

DOI: https://doi.org/10.5815/ijitcs.2025.01.05

Received: 22 Jul. 2024 / Revised: 8 Oct. 2024 / Accepted: 21 Dec. 2024 / Published: 8 Feb. 2025

Index Terms

Code-switching, Swahili, Low-resource Languages, Misinformation, Text Classification

Abstract

Code-switching, which is the mixing of words or phrases from multiple, grammatically distinct languages, introduces semantic and syntactic complexities to sentences which complicate automated text classification. Despite code-switching being a common occurrence in informal text-based communication among most bilingual or multilingual users of digital spaces, its use to spread misinformation is relatively less explored. In Kenya, for instance, the use of code-switched Swahili-English is prevalent on social media. Our main objective in this paper was to systematically re- view code-switching, particularly the use of Swahili-English code-switching to spread misinformation on social media in the Kenyan context. Additionally, we aimed at pre-processing a Swahili-English code-switched dataset and developing a misinformation classification model trained on this dataset. We discuss the process we took to develop the code- switched Swahili-English misinformation classification model. The model was trained and tested using the PolitiKweli dataset which is the first Swahili-English code-switched dataset curated for misinformation classification. The dataset was collected from Twitter (now X) social media platform, focusing on text posted during the electioneering period of the 2022 general elections in Kenya. The study experimented with two types of word embeddings - GloVe and FastText. FastText uses character n-gram representations that help generate meaningful vectors for rare and unseen words in the code-switched dataset. We experimented with both the classical machine learning algorithms and deep learning algo- rithms. Bidirectional Long Short-Term Memory Networks (BiLSTM) algorithm showed the best performance with an f-score of 0.89. The model was able to classify code-switched Swahili-English political misinformation text as fake, fact or neutral. This study contributes to recent research efforts in developing language models for low-resource languages.

Cite This Paper

Cynthia Amol, Lilian Wanzare, James Obuhuma, "Modelling Misinformation in Swahili-English Code-switched Texts", International Journal of Information Technology and Computer Science(IJITCS), Vol.17, No.1, pp.67-81, 2025. DOI:10.5815/ijitcs.2025.01.05

Reference

[1]Sacha Altay, Manon Berriche, Hendrik Heuer, Johan Farkas, and Steven Rathje. A survey of expert views on mis- information: Definitions, determinants, solutions, and future of the field. Harvard Kennedy School Misinformation Review, 4(4):1–34, 2023.
[2]Mozilla. New research: In kenya,    disinformation campaigns seek to discredit pandora papers. https://foundation.mozilla.org/en/blog/new-research-in-kenya-disinformation-campaigns-seek-to-discredit- pandora-papers/, 2021.
[3]Jawaher Alghamdi, Yuqing Lin, and Suhuai Luo. Fake news detection in low-resource languages: A novel hybrid summarization approach. Knowledge-Based Systems, page 111884, 2024.
[4]Barack Wanjawa, Lilian Wanzare, Florence Indede, Owen McOnyango, Edward Ombui, and Lawrence Muchemi. Kencorpus: A kenyan language corpus of swahili, dholuo and luhya for natural language processing tasks. arXiv preprint arXiv:2208.12081, 2022.
[5]Wilhelmina Nekoto, Vukosi Marivate, Tshinondiwa Matsila, Timi Fasubaa, Tajudeen Kolawole, Taiwo Fagbo- hungbe, Solomon Oluwole Akinola, Shamsuddeen Hassan Muhammad, Salomon Kabongo, Salomey Osei, et al. Participatory research for low-resourced machine translation: A case study in african languages. arXiv preprint arXiv:2010.02353, 2020.
[6]Michael A Hedderich, Lukas Lange, Heike Adel, Jannik Stro¨tgen, and Dietrich Klakow. A survey on recent ap- proaches for natural language processing in low-resource scenarios. arXiv preprint arXiv:2010.12309, 2020.
[7]Lusiana Kartika Candra and Laila Ulsi Qodriani. An analysis of code switching in leila s. chudori’s for nadira. Teknosastik, 16(1):9–14, 2019.
[8]Genta Indra Winata. Multilingual transfer learning for code-switched language and speech neural modeling. Hong Kong University of Science and Technology (Hong Kong), 2021.
[9]Cynthia Amol, Lilian Wanzare, and James Obuhuma. Politikweli: A swahili-english code-switched twitter political misinformation classification dataset. In Speech and Language Technologies for Low-Resource Languages, pages 3–17, Cham, 2024. Springer Nature Switzerland.
[10]Kshitij Rajput, Raghav Kapoor, Kaushal Rai, and Preeti Kaur. Hate me not: detecting hate inducing memes in code switched languages. arXiv preprint arXiv:2204.11356, 2022.
[11]Peter Nabende, David Kabiito, Claire Babirye, Hewitt Tusiime, and Joyce Nakatumba-Nabende. Misinformation detection in luganda-english code-mixed social media text. arXiv preprint arXiv:2104.00124, 2021.
[12]Edward Ombui, Lawrence Muchemi, and Peter Wagacha. Hate speech detection in code-switched text messages. In 2019 3rd International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT), pages 1–6. IEEE, 2019.
[13]Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532– 1543, 2014.
[14]Eslam Amer, Kyung-Sup Kwak, and Shaker El-Sappagh. Context-based fake news detection model relying on deep learning models. Electronics, 11(8):1255, 2022.
[15]Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enriching word vectors with subword information. Transactions of the association for computational linguistics, 5:135–146, 2017.
[16]Nabil Badri, Ferihane Kboubi, and Anja Habacha Chaibi. Combining fasttext and glove word embedding for offen- sive and hate speech text detection. Procedia Computer Science, 207:769–778, 2022.
[17]Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
[18]Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020.
[19]Flora Sakketou, Joan Plepi, Riccardo Cervero, Henri-Jacques Geiss, Paolo Rosso, and Lucie Flek. Factoid: A new dataset for identifying misinformation spreaders and political bias. arXiv preprint arXiv:2205.06181, 2022.
[20]Dimitrios Michail, Nikos Kanakaris, and Iraklis Varlamis. Detection of fake news campaigns using graph convolu- tional networks. International Journal of Information Management Data Insights, 2(2):100104, 2022.
[21]Marcos Zampieri, Preslav Nakov, Sara Rosenthal, Pepa Atanasova, Georgi Karadzhov, Hamdy Mubarak, Leon Derczynski, Zeses Pitenis, and C¸ ag˘rı C¸ o¨ltekin. Semeval-2020 task 12: Multilingual offensive language identification in social media (offenseval 2020). arXiv preprint arXiv:2006.07235, 2020.
[22]Deepti Jain, Sandhya Arora, CK Jha, and Garima Malik. Transformer-based models for hate speech classification. In AIP Conference Proceedings, volume 3072. AIP Publishing, 2024.
[23]Hind Saleh, Areej Alhothali, and Kawthar Moria. Detection of hate speech using bert and hate speech word embed- ding with deep model. Applied Artificial Intelligence, 37(1):2166719, 2023.
[24]David Ifeoluwa Adelani, Hannah Liu, Xiaoyu Shen, Nikita Vassilyev, Jesujoba O Alabi, Yanke Mao, Haonan Gao, and Annie En-Shiun Lee. Sib-200: A simple, inclusive, and big evaluation dataset for topic classification in 200+ languages and dialects. arXiv preprint arXiv:2309.07445, 2023.
[25]Odunayo Ogundepo, Tajuddeen R Gwadabe, Clara E Rivera, Jonathan H Clark, Sebastian Ruder, David Ifeoluwa Adelani, Bonaventure FP Dossou, Abdou Aziz DIOP, Claytone Sikasote, Gilles Hacheme, et al. Afriqa: Cross- lingual open-retrieval question answering for african languages. arXiv preprint arXiv:2305.06897, 2023.
[26]David Ifeoluwa Adelani, Marek Masiak, Israel Abebe Azime, Jesujoba Oluwadara Alabi, Atnafu Lambebo Tonja, Christine Mwase, Odunayo Ogundepo, Bonaventure FP Dossou, Akintunde Oladipo, Doreen Nixdorf, et al. Masakhanews: News topic classification for african languages. arXiv preprint arXiv:2304.09972, 2023.
[27]Shamsuddeen Hassan Muhammad, Idris Abdulmumin, Abinew Ali Ayele, Nedjma Ousidhoum, David Ifeoluwa Adelani, Seid Muhie Yimam, Ibrahim Sa’id Ahmad, Meriem Beloucif, Saif M Mohammad, Sebastian Ruder, et al. Afrisenti: A twitter sentiment analysis benchmark for african languages. arXiv preprint arXiv:2302.08956, 2023.
[28]Colin Leong, Joshua Nemecek, Jacob Mansdorfer, Anna Filighera, Abraham Owodunni, and Daniel White- nack. Bloom library: Multimodal datasets in 300+ languages for a variety of downstream tasks. arXiv preprint arXiv:2210.14712, 2022.
[29]Marta R Costa-jussa`, James Cross, Onur C¸ elebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, et al. No language left behind: Scaling human-centered machine transla- tion. arXiv preprint arXiv:2207.04672, 2022.
[30]Odunayo Ogundepo, Xinyu Zhang, Shuo Sun, Kevin Duh, and Jimmy Lin. Africlirmatrix: Enabling cross-lingual information retrieval for african languages. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 8721–8728, 2022.
[31]Alexandros Zervopoulos, Aikaterini Georgia Alvanou, Konstantinos Bezas, Asterios Papamichail, Manolis Maragoudakis, and Katia Kermanidis. Deep learning for fake news detection on twitter regarding the 2019 hong kong protests. Neural Computing and Applications, 34(2):969–982, 2022.
[32]Nishant Rai, Deepika Kumar, Naman Kaushik, Chandan Raj, and Ahad Ali. Fake news classification using trans- former based enhanced lstm and bert. International Journal of Cognitive Computing in Engineering, 3:98–105, 2022.
[33]Kuai Xu, Feng Wang, Haiyan Wang, and Bo Yang. Detecting fake news over online social media via domain reputations and content understanding. Tsinghua Science and Technology, 25(1):20–27, 2019.
[34]Lu Yuan, Hangshun Jiang, Hao Shen, Lei Shi, and Nanchang Cheng. Sustainable development of information dissemination: A review of current fake news detection research and practice. Systems, 11(9):458, 2023.
[35]Sakshini Hangloo and Bhavna Arora. Fake news detection tools and methods–a review. arXiv preprint arXiv:2112.11185, 2021.
[36]Jon Roozenbeek, Eileen Culloty, and Jane Suiter. Countering misinformation. European Psychologist, 2023.
[37]Xinyi Zhou and Reza Zafarani. A survey of fake news: Fundamental theories, detection methods, and opportunities. ACM Computing Surveys (CSUR), 53(5):1–40, 2020.
[38]Jennifer Jerit and Yangzi Zhao. Political misinformation. Annual Review of Political Science, 23:77–94, 2020.
[39]Kelvin Jiang, Ronak Pradeep, and Jimmy Lin. Exploring listwise evidence reasoning with t5 for fact verification. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 402–410, 2021.
[40]Kai Shu, Guoqing Zheng, Yichuan Li, Subhabrata Mukherjee, Ahmed Hassan Awadallah, Scott Ruston, and Huan Liu. Early detection of fake news with multi-source weak social supervision. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2020, Ghent, Belgium, September 14–18, 2020, Proceedings, Part III, pages 650–666. Springer, 2021.
[41]Deptii Chaudhari and Ambika Vishal Pawar. Empowering propaganda detection in resource-restraint languages: a transformer-based framework for classifying hindi news articles. Big Data and Cognitive Computing, 7(4):175, 2023.
[42]Kelly Stahl. Fake news detection in social media. California State University Stanislaus, 6:4–15, 2018.
[43]Lynete Lusike Mukhongo. Participatory media cultures: virality, humour, and online political contestations in kenya. Africa Spectrum, 55(2):148–169, 2020.
[44]Andreu Casero-Ripolle´s. Influencers in the political conversation on twitter: Identifying digital authority with big data. Sustainability, 13(5):2851, 2021.
[45]Sunayana Sitaram, Khyathi Raghavi Chandu, Sai Krishna Rallabandi, and Alan W Black. A survey of code-switched speech and language processing. arXiv preprint arXiv:1904.00784, 2019.
[46]Wilkinson Daniel Wong Gonzales and Yuen Man Tsang. The sociolinguistics of code-switching in hong kong’s digital landscape: A mixed-methods exploration of cantonese-english alternation patterns on whatsapp. Journal of English and Applied Linguistics, 2(1):2, 2023.
[47]Navya Jose, Bharathi Raja Chakravarthi, Shardul Suryawanshi, Elizabeth Sherly, and John P McCrae. A survey of current datasets for code-switching research. In 2020 6th international conference on advanced computing and communication systems (ICACCS), pages 136–141. IEEE, 2020.
[48]Xinyi Zhong, Lay Hoon Ang, and Sharon Sharmini. Systematic literature review of conversational code-switching in multilingual society from a sociolinguistic perspective. Theory and Practice in Language Studies, 13(2):318–330, 2023.
[49]Shruti Rijhwani, Royal Sequiera, Monojit Choudhury, Kalika Bali, and Chandra Shekhar Maddila. Estimating code- switching on twitter with a novel generalized word-level language detection technique. In Proceedings of the 55th annual meeting of the association for computational linguistics (volume 1: long papers), pages 1971–1982, 2017.
[50]Charlotte Hoffmann. Introduction to bilingualism. Routledge, 2014.
[51]Maya Sari, Adip Arifin, and Ratri Harida. Code-switching and code-mixing used by guest star in hotman paris show. Journal of English Language Learning, 5(2), 2021.
[52]Janet Holmes and Nick Wilson. An introduction to sociolinguistics. Routledge, 2022.
[53]Hans Ko¨chler. Idea and politics of communication in the global age. Digital Transformation in Journalism and News Media: Media Management, Media Convergence and Globalization, pages 7–15, 2017.
[54]Comfort Eseohen Ilevbare, Jesujoba O Alabi, David Ifeoluwa Adelani, Firdous Damilola Bakare, Oluwatoyin Bunmi Abiola, and Oluwaseyi Adesina Adeyemo. Ekohate: Abusive language and hate speech detection for code-switched political discussions on nigerian twitter. arXiv preprint arXiv:2404.18180, 2024.
[55]Muhammad Umar Salman, Asif Hanif, Shady Shehata, and Preslav Nakov. Detecting propaganda techniques in code-switched social media text. arXiv preprint arXiv:2305.14534, 2023.
[56]Kai Shu, Deepak Mahudeswaran, Suhang Wang, Dongwon Lee, and Huan Liu. Fakenewsnet: A data repository with news content, social context, and spatiotemporal information for studying fake news on social media. Big data, 8(3):171–188, 2020.
[57]Vivek Srivastava and Mayank Singh. Challenges and considerations with code-mixed nlp for multilingual societies. arXiv preprint arXiv:2106.07823, 2021.
[58]Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
[59]Matthew E Peters, Mark Neumann, Luke Zettlemoyer, and Wen-tau Yih. Dissecting contextual word embeddings: Architecture and representation. arXiv preprint arXiv:1808.08949, 2018.
[60]Wazir Ali, Jay Kumar, Junyu Lu, and Zenglin Xu. Word embedding based new corpus for low-resourced language: Sindhi. arXiv preprint arXiv:1911.12579, 2019.
[61]Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak Shah. Transformers in vision: A survey. ACM computing surveys (CSUR), 54(10s):1–41, 2022.
[62]Qian Li, Hao Peng, Jianxin Li, Congying Xia, Renyu Yang, Lichao Sun, Philip S Yu, and Lifang He. A survey on text classification: From traditional to deep learning. ACM Transactions on Intelligent Systems and Technology (TIST), 13(2):1–41, 2022.
[63]Sima Siami-Namini, Neda Tavakoli, and Akbar Siami Namin. The performance of lstm and bilstm in forecasting time series. In 2019 IEEE International conference on big data (Big Data), pages 3285–3292. IEEE, 2019.
[64]Haneul Yoo, Yongjin Yang, and Hwaran Lee. Csrt: Evaluation and analysis of llms using code-switching red-teaming dataset. arXiv preprint arXiv:2406.15481, 2024.
[65]Ruthanna Barnett, Eva Codo´, Eva Eppler, Montse Forcadell, Penelope Gardner-Chloros, Roeland Van Hout, Melissa Moyer, Maria Carme Torras, Maria Teresa Turell, Mark Sebba, et al. The lides coding manual: A document for preparing and analyzing language interaction data version 1.1—july, 1999. International Journal of Bilingualism, 4(2):131–132, 2000.
[66]Carla Vairetti, Jose´ Luis Assadi, and Sebastia´n Maldonado. Efficient hybrid oversampling and intelligent undersam- pling for imbalanced big data classification. Expert Systems with Applications, 246:123149, 2024.
[67]Tarid Wongvorachan, Surina He, and Okan Bulut. A comparison of undersampling, oversampling, and smote meth- ods for dealing with imbalanced classification in educational data mining. Information, 14(1):54, 2023.
[68]Asif Newaz and Farhan Shahriyar Haq. A novel hybrid sampling framework for imbalanced learning. arXiv preprint arXiv:2208.09619, 2022.
[69]Zhaozhao Xu, Derong Shen, Tiezheng Nie, and Yue Kou. A hybrid sampling algorithm combining m-smote and enn based on random forest for medical imbalanced data. Journal of Biomedical Informatics, 107:103465, 2020.
[70]J Sangeetha and U Kumaran. A hybrid optimization algorithm using bilstm structure for sentiment analysis. Mea- surement: Sensors, 25:100619, 2023.
[71]Bernard Masua and Noel Masasi. Enhancing text pre-processing for swahili language: Datasets for common swahili stop-words, slangs and typos with equivalent proper words. Data in Brief, 33:106517, 2020.
[72]Edward Gow-Smith, Harish Tayyar Madabushi, Carolina Scarton, and Aline Villavicencio. Improving tokenisation by alternative treatment of spaces. arXiv preprint arXiv:2204.04058, 2022.
[73]Mohammadreza Heydarian, Thomas E Doyle, and Reza Samavi. Mlcm: Multi-label confusion matrix. IEEE Access, 10:19083–19095, 2022.
[74]Mike Schuster and Kuldip K. Paliwal. Bidirectional recurrent neural networks. IEEE Transactions on Signal Pro- cessing, 45(11):2673–2681, 1997.
[75]Zabit Hameed and Begonya Garcia-Zapirain. Sentiment classification using a single-layered bilstm model. Ieee Access, 8:73992–74001, 2020.
[76]Qi Dong, Xiatian Zhu, and Shaogang Gong. Single-label multi-class image classification by deep logistic regression. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 3486–3493, 2019.
[77]Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436–444, 2015.