Using wordnet to enhance feature selection in automated text categorization

Chua, Stephanie Hui Li (2004) Using wordnet to enhance feature selection in automated text categorization. Masters thesis, Universiti Malaysia Sarawak, (UNIMAS).

[img] PDF (Please get the password by email to repository@unimas.my, or call ext: 3914/ 3942/ 3933)
Stephanie Chua Hui Li ft.pdf
Restricted to Registered users only

Download (5MB)

Abstract

the field of automated text categorization, the large dimensionality of the feature space is a major problem as it involves extensive computations. Feature selection is one of the approaches to reduce the dimensionality of the feature space. This research explores the use of WordNet (Miller et al., 1990), a lexical database, for performing feature selection for an automated text categorization system. The WordNet-based approach employs lexical and semantics information for feature selection. WordNet allows the selection of terms that are lexically and semantically representative of a category of documents, as opposed to statistical approaches traditionally used for feature selection. f' We proposed three WordNet based approaches for feature selection. The first one is to use the WordNet nouns approach that selects all nouns in WordNet that occur in each category as features. The second approach is based on lexical semantics that selects synonymous terms that co-occur in a category while the third approach is a combination of the lexical semantics approach with statistical feature selection methods. The lexical semantics approach performed better than the WordNet nouns approach with more than 40% of reduction in feature space in the experiments using the Reuters-21578 dataset. The lexical semantics approach also outperformed popular statistical feature selection methods, namely, Chi-Square (Chi2) and Information Gain (IG). The combined approach has improved the performance of the statistical methods. WordNet has successfully been used to enhance feature selection, highlighting the possibility of determining semantic features automatically. The limitations of the lexical semantics approach are also highlighted, proposing an improved framework and an extension to overcome them.

Item Type: Thesis (Masters)
Additional Information: Thesis (M.Sc.) -- Universiti Malaysia Sarawak, 2004.
Uncontrolled Keywords: Text processing, Computer science, Text editors, Computer programs), software, unimas, university, universiti, Borneo, Malaysia, Sarawak, Kuching, Samarahan, ipta, education, Postgraduate, research, Universiti Malaysia Sarawak
Subjects: T Technology > T Technology (General)
Divisions: Academic Faculties, Institutes and Centres > Faculty of Computer Science and Information Technology
Faculties, Institutes, Centres > Faculty of Computer Science and Information Technology
Academic Faculties, Institutes and Centres > Faculty of Computer Science and Information Technology
Depositing User: Karen Kornalius
Date Deposited: 05 Jul 2016 02:03
Last Modified: 10 May 2023 06:57
URI: http://ir.unimas.my/id/eprint/12604

Actions (For repository members only: login required)

View Item View Item