Building Standard Offline Anti-phishing Dataset for Benchmarking

Chiew, Kang Leng and Ee, Hung Chang and Choon, Lin Tan and Johari, Bin Abdullah and Sheng, Kelvin Chek Yong (2018) Building Standard Offline Anti-phishing Dataset for Benchmarking. International Journal of Engineering & Technology, 7 (4.31). pp. 7-14. ISSN 2227-524X

[img] PDF
Building Standard Offline Anti-phishing Dataset for .... - Copy.pdf

Download (845kB)
Official URL: https://www.sciencepubco.com/index.php/ijet

Abstract

Anti-phishing research is one of the active research fields in information security. Due to the lack of a publicly accessible standard test dataset, most of the researchers are using their own dataset for the experiment. This makes the benchmarking across different antiphishing techniques become challenging and inefficient. In this paper, we propose and construct a large-scale standard offline dataset that is downloadable, universal and comprehensive. In designing the dataset creation approach, major anti-phishing techniques from the literature have been thoroughly considered to identify their unique requirements. The findings of this requirement study have concluded several influencing factors that will enhance the dataset quality, which includes: the type of raw elements, source of the sample, sample size, website category, category distribution, language of the website and the support for feature extraction. These influencing factors are the core to the proposed dataset construction approach, which produced a collection of 30,000 samples of phishing and legitimate webpages with a distribution of 50 percent of each type. Thus, this dataset is useful and compatible for a wide range of anti-phishing researches in conducting the benchmarking as well as beneficial for a research to conduct a rapid proof of concept experiment. With the rapid development of anti-phishing research to counter the fast evolution of phishing attacks, the need of such dataset cannot be overemphasised. The complete dataset is available for download at http://www.fcsit.unimas.my/research/legit-phish-set.

Item Type: E-Article
Uncontrolled Keywords: Anti-phishing; Dataset for benchmarking; Features; Legitimate and phishing webpages, unimas, university, universiti, Borneo, Malaysia, Sarawak, Kuching, Samarahan, ipta, education, research, Universiti Malaysia Sarawak.
Subjects: Q Science > Q Science (General)
Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Divisions: Academic Faculties, Institutes and Centres > Faculty of Computer Science and Information Technology
Depositing User: Gani
Date Deposited: 03 Jan 2019 08:15
Last Modified: 10 Jul 2019 08:19
URI: http://ir.unimas.my/id/eprint/22983

Actions (For repository members only: login required)

View Item View Item