Work place: Maseno University, Department of Computer Science, Maseno, Kenya
E-mail: cynthia@maseno.ac.ke
Website:
Research Interests:
Biography
Cynthia Jayne Amol is a final year Masters in Computer Science student at Maseno University, Kenya and holds a Bachelor of Science in Computer Science (Maseno University). An AI enthusiast, she is a Natural Language Processing (NLP) researcher working on low resource languages and has contributed to dataset curation and language modelling for African languages. She is also an experienced systems and data analyst, working at Maseno University’s ICT Directorate.
By Cynthia Amol Lilian Wanzare James Obuhuma
DOI: https://doi.org/10.5815/ijitcs.2025.01.05, Pub. Date: 8 Feb. 2025
Code-switching, which is the mixing of words or phrases from multiple, grammatically distinct languages, introduces semantic and syntactic complexities to sentences which complicate automated text classification. Despite code-switching being a common occurrence in informal text-based communication among most bilingual or multilingual users of digital spaces, its use to spread misinformation is relatively less explored. In Kenya, for instance, the use of code-switched Swahili-English is prevalent on social media. Our main objective in this paper was to systematically re- view code-switching, particularly the use of Swahili-English code-switching to spread misinformation on social media in the Kenyan context. Additionally, we aimed at pre-processing a Swahili-English code-switched dataset and developing a misinformation classification model trained on this dataset. We discuss the process we took to develop the code- switched Swahili-English misinformation classification model. The model was trained and tested using the PolitiKweli dataset which is the first Swahili-English code-switched dataset curated for misinformation classification. The dataset was collected from Twitter (now X) social media platform, focusing on text posted during the electioneering period of the 2022 general elections in Kenya. The study experimented with two types of word embeddings - GloVe and FastText. FastText uses character n-gram representations that help generate meaningful vectors for rare and unseen words in the code-switched dataset. We experimented with both the classical machine learning algorithms and deep learning algo- rithms. Bidirectional Long Short-Term Memory Networks (BiLSTM) algorithm showed the best performance with an f-score of 0.89. The model was able to classify code-switched Swahili-English political misinformation text as fake, fact or neutral. This study contributes to recent research efforts in developing language models for low-resource languages.
[...] Read more.Subscribe to receive issue release notifications and newsletters from MECS Press journals