Mohammad Firdaus, Johari and Chiew, Kang Leng and Abdul Razak, Hosen and Kelvin S., C.Yong and Adnan Shahid, Khan and Irshad Ahmed, Abassi and Daniel, Grzonka (2025) Key insights into recommended SMS spam detection datasets. Scientific Reports, 15 (8162). pp. 1-24. ISSN 2045-2322
![]() |
PDF
Key insights.pdf Download (2MB) |
Abstract
Short Message Service (SMS) spam poses significant risks, including financial scams and phishing attempts. Although numerous datasets from online repositories have been utilized to address this issue, little attention has been given to evaluating their effectiveness and impact on SMS spam detection models. This study fills this gap by assessing the performance of ten SMS spam detection datasets using Decision Tree and Multinomial Naïve Bayes models. Datasets were evaluated based on accuracy and qualitative factors such as authenticity, class imbalance, feature diversity, metadata availability, and preprocessing needs. Due to the multilingual nature of the datasets, experiments were conducted with two stopword removal groups: one in English and another in the respective non-English languages. The key findings of this research have led to the recommendation of Dataset 5 for future SMS spam detection research, as evidence from the dataset’s high qualitative assessment score of 3.8 out of 5.0 due to its high feature diversity, real-world complexity, and balanced class distribution, and low detection rate of 86.10% from Multinomial Naïve Bayes. Recommending a dataset that poses challenges for high model performance fosters the development of more robust and adaptable spam detection models capable of handling diverse forms of noise and ambiguity. Furthermore, selecting the dataset with the highest qualitative score enhances research quality, improves model generalizability, and mitigates risks related to bias and inconsistencies.
Item Type: | Article |
---|---|
Uncontrolled Keywords: | SMS spam detection, Dataset evaluation, Stopwords removal, Machine learning, Dataset recommendation. |
Subjects: | Q Science > QA Mathematics > QA75 Electronic computers. Computer science |
Divisions: | Academic Faculties, Institutes and Centres > Faculty of Computer Science and Information Technology Faculties, Institutes, Centres > Faculty of Computer Science and Information Technology Academic Faculties, Institutes and Centres > Faculty of Computer Science and Information Technology |
Depositing User: | Gani |
Date Deposited: | 14 Mar 2025 00:48 |
Last Modified: | 14 Mar 2025 00:48 |
URI: | http://ir.unimas.my/id/eprint/47774 |
Actions (For repository members only: login required)
![]() |
View Item |