Semi-Automatic Sentiment Identification for Malay-English Code-Switched Data

Afifah, Mohd Shamsuddin and Sarah Flora, Samson Juan and Stephanie, Chua and Arif, Bramantoro (2024) Semi-Automatic Sentiment Identification for Malay-English Code-Switched Data. Journal of Advanced Research Design, 123 (1). pp. 198-212. ISSN 2289-7984

PDF
5640-Article Text-28397-1-10-20250109.pdf
Download (941kB)

Official URL: https://www.akademiabaru.com/submit/index.php/ard/...

Abstract

Informal online communication on social media platforms like Twitter, Facebook, and YouTube often involves code-switching between languages, notably Malay and English, in Malaysia due to its diverse society. This phenomenon challenges sentiment analysis tasks, as the intermixing of languages within sentences or phrasesincreases the likelihood of inaccurately classified sentiment. Sentiment analysers built with models trained with monolingual data will cause misclassification due to out-of-vocabulary issues, thus limiting the efficacy of these analysers on code-switcheddata. Plus, obtaining a code-switched corpus annotated with sentiment labels is scarce, and investigations on sentiment analysis on code-switched social media data are lacking, particularly on Malaysian social media posts. We proposed MESocSentiment, a Malay-English social media corpus with sentiment labels constructed via a semi-automatic sentiment identification method to address this challenge. The framework leveraged existing language identifiers and sentiment analysers to annotate code-switched data. We collected 229,566 tweets containing #Malaysia between August 12, 2022, and May 15, 2023. Using our strategy, we identified 19714 code-switched posts containing Malay and English words from the collection. Our analysis showed that 78.23% of the corpus had neutral sentiments, while 16.32% were positive and 5.44% negative. Furthermore, the descriptive analysis of tweet length revealed a range spanning 43 words, with a mean of 8.92 and a standard deviation of 5.56 words. This comprehensive framework contributes to a deeper understanding of sentiment expression within code-switched social media data, particularly in the context of Malaysia's linguistic and cultural diversity. Additionally, the MESocSentiment corpus is published on GitHub for future research.

Item Type:	Article
Uncontrolled Keywords:	Sentiment analysis; code-switching; sentiment classification; natural language processing.
Subjects:	Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Divisions:	Academic Faculties, Institutes and Centres > Faculty of Computer Science and Information Technology Faculties, Institutes, Centres > Faculty of Computer Science and Information Technology Academic Faculties, Institutes and Centres > Faculty of Computer Science and Information Technology
Depositing User:	Samson Juan
Date Deposited:	22 Jan 2025 04:02
Last Modified:	22 Jan 2025 04:02
URI:	http://ir.unimas.my/id/eprint/47390

Actions (For repository members only: login required)

View Item