Psychosocial Features for Hate Speech Detection in Code-switched Texts

Full Text (PDF, 595KB), PP.29-47

Views: 0 Downloads: 0

Author(s)

Edward Ombui 1,* Lawrence Muchemi 2 Peter Wagacha 2

1. School of Science and Technology, Africa Nazarene University, Nairobi, Kenya

2. School of Computing and Informatics, University of Nairobi, Nairobi, Kenya

* Corresponding author.

DOI: https://doi.org/10.5815/ijitcs.2021.06.03

Received: 31 May 2021 / Revised: 24 Jul. 2021 / Accepted: 9 Aug. 2021 / Published: 8 Dec. 2021

Index Terms

Hate Speech, Classification, Code-switching, Feature selection, Machine learning

Abstract

This study examines the problem of hate speech identification in codeswitched text from social media using a natural language processing approach. It explores different features in training nine models and empirically evaluates their predictiveness in identifying hate speech in a ~50k human-annotated dataset. The study espouses a novel approach to handle this challenge by introducing a hierarchical approach that employs Latent Dirichlet Analysis to generate topic models that help build a high-level Psychosocial feature set that we acronym PDC. PDC groups similar meaning words in word families, which is significant in capturing codeswitching during the preprocessing stage for supervised learning models. The high-level PDC features generated are based on a hate speech annotation framework [1] that is largely informed by the duplex theory of hate [2]. Results obtained from frequency-based models using the PDC feature on the dataset comprising of tweets generated during the 2012 and 2017 presidential elections in Kenya indicate an f-score of 83% (precision: 81%, recall: 85%) in identifying hate speech. The study is significant in that it publicly shares a unique codeswitched dataset for hate speech that is valuable for comparative studies. Secondly, it provides a methodology for building a novel PDC feature set to identify nuanced forms of hate speech, camouflaged in codeswitched data, which conventional methods could not adequately identify.

Cite This Paper

Edward Ombui, Lawrence Muchemi, Peter Wagach, "Psychosocial Features for Hate Speech Detection in Code-switched Texts", International Journal of Information Technology and Computer Science(IJITCS), Vol.13, No.6, pp.29-47, 2021. DOI:10.5815/ijitcs.2021.06.03

Reference

[1]E. Ombui, L. Muchemi, and M. Karani, “Annotation Framework for Hate Speech Identification in Tweets: Case Study of Tweets during Kenyan Elections.”
[2]R. Sternberg and K. Sternberg, “The Duplex Theory of Hate I: The Triangular Theory of the Structure of Hate. In The Nature of Hate,” Cambridge Univ. Press, pp. 51–77, 2008.
[3]A. Des Forges, “Leave None To Tell The Story: Genocide in Rwanda,” New York Hum. Rights Watch, 1999.
[4]S. Benesch, “Dangerous Speech: A Proposal to Prevent Group Violence,” 2012.
[5]L. Silva, M. Mondal, D. Correa, F. Benevenuto, and I. Weber, “Analyzing the Targets of Hate in Online Social Media,” in Tenth International AAAI Conference on Web and Social Media, 2016, pp. 687–690.
[6]R. Hatzipanagos, “How online hate turns into real-life violence,” The Washington Post, Washington, 30-Nov-2018.
[7]R. Ajulu, “Politicised Ethnicity, Competitive Politics and Conflict in Kenya: A Historical Perspective,” Afr. Stud., vol. 61, no. 2, pp. 251–268, 2002.
[8]P. Makori, “Whatsapp admins face jail in crackdown to curb hate-speech,” Business Today, 17-Jul-2017.
[9]S. Madonsela, “A critical analysis of the use of code-switching in Nhlapho’s novel Imbali YemaNgcamane,” South African J. African Lang., vol. 34, no. 2, pp. 167–174, 2014.
[10]E. Ombui and L. Muchemi, “Wiring Kenyan Languages for the Global Virtual Age: An audit of the Human Language Technology Resources,” Int. J. Sci. Res. Innov. Technol., vol. 2, no. 2, pp. 35–42, 2015.
[11]M. Karani, E. Ombui, and A. Gichamba, “The Design and Development of a Custom Text Annotator,” in IEEE Africon, 2019.
[12]A. I. Ansaldo, K. Marcotte, L. Scherer, and G. Raboyeau, “Language therapy and bilingual aphasia: Clinical Implications of psycholinguistic and neuroimaging research,” J. Neurolinguistics, vol. 21, 539–55, 2018.
[13]L. Muaka, “Language Perceptions and Identity among Kenyan Speakers,” in Proceedings of the 40th Annual Conference on African Linguistics, 2011.
[14]Z. Waseem and D. Hovy, “Hateful symbols or hateful people? predictive features for hate speech detection on twitter.,” in In Proceedings of NAACL-HLT, 2016, pp. 88–93.
[15]Priya Gupta, Aditi Kamra, Richa Thakral, Mayank Aggarwal, Sohail Bhatti, Vishal Jain, "A Proposed Framework to Analyze Abusive Tweets on the Social Networks", International Journal of Modern Education and Computer Science, Vol.10, No.1, pp. 46-56, 2018.
[16]W. Warner and J. Hirschberg, “Detecting hate speech on the world wide web,” in Language in Social Media (LSM 2012), 2012.
[17]I. Kwok and Y. Wang, “Locate the hate: Detecting tweets against blacks,” AAAI, 2013.
[18]D. N. Gitari, Z. Zuping, H. Damien, and J. Long, “A lexicon-based approach for hate speech detection.,” J. Multimed. Ubiquitous Eng., vol. 4, no. 10, pp. 215–230, 2015.
[19]E. Spertus, “Smokey: Automatic recognition of hostile Messages,” in IAAI, 1997.
[20]Y. Chen, Y. Zhou, S. Zhu, and H. Xu, “Detecting offensive language in social media to protect adolescent online safety,” in The fourth ASE/IEEE international conference on social computing (SocialCom 2012), 2012.
[21]D. K, B. Jones, C. Havasi, H. Lieberman, and R. Picard, “Common sense reasoning for detection, prevention, and mitigation of cyberbullying.,” ACM Trans Interact Intell Syst, vol. 3, no. 2, 2012.
[22]C. Van Hee and G. De Pauw, “Automatic Detection and Prevention of Cyberbullying,” in The First International Conference on Human and Social Analytics, 2015.
[23]S. Agarwal and A. Sureka, “Using KNN and SVM Based One-Class Classifier for Detecting Online Radicalization on Twitter,” in The 11th International Conference on Distributed Computing and Internet Technology, 2015, pp. 431–442.
[24]M. Last, A. Markov, and A. Kandel, “Multi-lingual Detection of Terrorist Content on the Web,” in International Workshop on Intelligence and Security Informatics, 2006.
[25]E. Lozano, J. Cedeno, G. Castillo, F. Layedra, H. Lasso, and C. Vaca, “Requiem for online harassers: Identifying racism from political tweets,” in Fourth International Conference on eDemocracy & eGovernment (ICEDEG), 2017.
[26]S. Tulkens, L. Hilte, E. Lodewyckx, B. Verhoeven, and W. Daelemans, “The Automated Detection of Racist Discourse in Dutch Social Media,” CoRR, abs/1608.08738, 2016.
[27]T. Davidson, D. Warmsley, M. Macy, and I. Weber, “AutomatedHateSpeechDetectionandtheProblemofOffensiveLanguage,” in ICWSM, 2017.
[28]P. Badjatiya, S. Gupta, M. Gupta, and V. Varma, “Deep Learning for Hate Speech Detection in Tweets,” in 2017 International World Wide Web Conference Committee, 2017.
[29]P. Fortuna, “Automatic detection of hate speech in text: an overview of the topic and dataset annotation with hierarchical classes,” University of Porto, 2017.
[30]P. Burnap and M. L. Williams, “Cyber Hate Speech on Twitter: An Application of Machine Classification and Statistical Modeling for Policy and Decision Making,” Policy & Internet, vol. 2, no. 7, pp. 223–242, 2015.
[31]C. Nobata, J. Tetreault, A. Thomas, Y. Mehdad, and Y. Chang, “Abusive Language Detection in Online User Content,” in 25th International Conference on World Wide Web, 2016, pp. 145–153.
[32]P. . O’Sullivan and A. . Flanagin, “Reconceptualizing ‘flaming’ and other problematic messages,” New Media Soc., vol. 5, pp. 69–94, 2003.
[33]A. Mahmud, K. . Ahmed, and M. Khan, “Detecting Flames and Insults in Text,” in In Proceedings of the 6th International Conference on Natural Language Processing, 2008.
[34]A. Razavi, D. Inkpen, S. Uritsky, and S. Matwin, “Offensive Language Detection Using Multi-level Classification,” Springer, p. 1627, 2010.
[35]I. Chaudhry, “Hashtagging hate: Using twitter to track racism online,” First Monday 20(2), 2015. .
[36]S. Liu and T. Forss, “New classification models for detecting Hate and Violence web content,” in 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K), 2015, pp. 487–495.
[37]A. Gaydhani, V. Doma, S. Kendre, and L. Bhagwat, “Detecting Hate Speech and Offensive Language on Twitter using Machine Learning: An N-gram and TFIDF based Approach,” 2018.
[38]M. Hasanuzzaman, G. Dias, and A. Way, “DemographicWordEmbeddingsforRacismDetectiononTwitter,” in Proceedings of the The 8th International Joint Conference on Natural Language Processing, 2017, pp. 926–936.
[39]N. Djuric, J. Zhou, M. Morris, Robin Grbovic, V. Radosavljevic, and N. Bhamidipati, “Hate speech detection with comment embeddings,” in In Proceedings of the 24th InternationalConferenceonWorldWideWeb(WWW2015), 2015, pp. 29–30.
[40]J. Pennington, R. Socher, and C. Manning, “GloVe: Global Vectors for Word Representation,” 2014. [Online]. Available: https://nlp.stanford.edu/pubs/glove.pdf. [Accessed: 19-Sep-2019].
[41]Z. Mossie and J.-H. Wang, “SOCIAL NETWORK HATE SPEECH DETECTION FOR AMHARIC LANGUAGE,” in COMIT, 2018, pp. 41–55.
[42]A. Al-Hassan and H. Al-Dosari, “Detection of Hate Speech in Social Networks: A Survey on Multilingual Corpus,” in 6th International Conference on Computer Science and Information Technology, 2019.
[43]D. Gamal, M. Alfonse, M. E.-H. El-Sayed, and A.-B. M.Salem, “Twitter Benchmark Dataset for Arabic Sentiment Analysis,” Int. J. Mod. Educ. Comput. Sci., vol. 11, no. 1, pp. 33–38, 2019.
[44]W. Alorainy, P. Burnap, H. Liu, and M. L. Williams, “‘The Enemy Among Us’: Detecting Cyber HateSpeech with Threats-based Othering Language Embeddings,” ACM, 2019.
[45]K. N. B. of Statistics, “2019 Kenya Population and Housing Census Volume I: Population by County and Sub-County,” 2019.
[46]H. Kim, S. Jang, Mo, S.-H. Kim, and A. Wan, “Evaluating Sampling Methods for Content Analysis of Twitter Data,” Sage, 2018.
[47]A. E. Kim, H. M. Hansen, J. Murphy, A. K. Richards, J. Duke, and J. A. Allen, “Methodological Considerations in analyzing Twitter data,” J. Natl. Cancer Inst., vol. 47, pp. 140–146, 2013.
[48]P. . Cavazos-Rehg et al., “A content analysis of depression-related tweets,” Comput. Hum. Behav., vol. 54, pp. 351–357, 2016.
[49]C. Shearer, The CRISP-DM model: the new blueprint for data mining. 2000.
[50]F. Provost and T. Fawcett, Data Science for Business: What You Need to Know About Data Mining and Data-Analytic Thinking, First Edit. O’Reilly Media, Inc., 2013.
[51]Z. Waseem, “Are You a Racist or Am I Seeing Things? Annotator Influence on Hate Speech Detection on Twitter,” in EMNLP Workshop on NLP and CSS, 2016, pp. 138–142.
[52]W. Warner and J. Hirschberg, “Detecting Hate Speech on the World Wide Web,” in Language in Social Media (LSM 2012), 2012.
[53]P. Burnap and M. L. Williams, “Us and them: identifying cyber hate on twitter across multiple protected characteristics.,” EPJ Data Sci., 2016.
[54]R. . King and G. M. Sutton, “High Times for Hate Crime: Explaining the Temporal Clustering of Hate Motivated Offending,” Criminology, vol. 51, no. 4, pp. 71–94, 2013.
[55]J. Brownlee, Deep Learning for Natural Language Processing, V1.2. 2018.
[56]D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” J. Mach. Learn. Res., vol. 3, pp. 993–1022, 2003.
[57]M. Elsherief, V. Kulkarni, D. Nguyen, W. Wang, and E. Belding, “Hate lingo: A target-based linguistic analysis of hate speech in social media,” in 12th International AAAI Conference on Web and Social Media, 2018, pp. 42–51.
[58]N. Coupland, “‘Other’ representation, Society and Language.” John Benjamins Publishing, 2010.
[59]G. R. Semin, “Linguistic Markers of Social Distance and Proximity.” 2009.
[60]M. Cikara, M. M. Botvinick, and S. T. Fiske, “Us versus them: Social identity shapes neural responses to intergroup competition and harm,” Psychol. Sci., vol. 22, no. 3, pp. 306–313, 2011.
[61]N. Haslam, “Dehumanization: An integrative review,” Personal. Soc. Psychol. Rev., vol. 10, pp. 252–64, 2006.
[62]F. Pedregosa et al., “Scikit-learn: Machine Learning in Python,” J. Mach. Learn. Res., vol. 12, pp. 2825–2830, 2011.
[63]K. Krippendorff, “Computing Krippendorff’s Alpha-Reliability,” University of Pennsylvania ScholarlyCommons, 2011. [Online]. Available: mhttp://repository.upenn.edu/asc_papers/43.
[64]W. Clyne, S. Pezaro, K. Deeny, and R. Kneasfsey, “Using Social Media to Generate and Collect Primary Data: The #ShowsWorkplaceCompassion Twitter Research Campaign,” JMIR Public Heal. Surveill, vol. 4, no. 2, p. e41, 2018.
[65]V. Dijk and A. Teun, “Discourse and racism, The Blackwell companion to racial and ethnic studies,” pp. 145–159, 2002.
[66]S. Sood, J. Antin, and E. Churchill, “Profanity use in online communities,” in Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 1481–1490.
[67]Edward Ombui, Lawrence Muchemi, Peter Wagacha, "Building and Annotating a Codeswitched Hate Speech Corpora", International Journal of Information Technology and Computer Science, Vol.13, No.3, pp.33-52, 2021.
[68]A. Schmidt and M. Wiegand, “A Survey on Hate Speech Detection using Natural Language Processing,” SocialNLP@EACL, 2017.
[69]P. Fortuna, L. da Silva, Jo˜ao Rocha Soler-Company, Juan Wanner, and S. Nunes, “A Hierarchically-Labeled Portuguese HateSpeech Dataset,” in Proceedings of the Third Workshop on Abusive Language Online, 2019, pp. 94–104.
[70]B. Ross, M. Rist, G. Carbonell, B. Cabrera, N. Kurowsky, and M. Wojatzki, “Measuring the Reliability of Hate Speech Annotations: The Case of the European Refugee Crisis,” arxiv:1701.08118, vol. 1, 2017.
[71]Y. R. Tausczik and J. W. Pennebaker, “The Psychological Meaning of Words: LIWC and Computerized Text Analysis Methods,” J. Lang. Soc. Psychol., vol. 1, no. 29, 2010.
[72]E. Alpaydin, Introduction to Machine Learning, 2nd Editio. London: The MIT Press, 2010.
[73]L. Chen, “Support Vector Machine — Simply Explained,” Towards Data Science, 2019. [Online]. Available: https://towardsdatascience.com/support-vector-machine-simply-explained-fee28eba5496. [Accessed: 02-Apr-2020].
[74]J. Brownlee, Master Machine Learning Algorithms: Discover How They Work and Implement Them From Scratch. 2016.
[75]F. Peng, D. Schuurmans, V. Keselj, and S. Wang, “Language independent authorship attribution with character level n-grams,” in 10th Conference of the European Chapter of the Association for Computational Linguistics, 2003, pp. 267–274.
[76]J. Kruczek, P. Kruczek, and M. Kuta, “Are n-gram Categories Helpful in Text Classification?,” in International Conference on Computational Science, 2020, pp. 524–537.