Generic named-entity recognition for indigenous languages of Sarawak (Nersil)

Yong, Soo Fong (2013) Generic named-entity recognition for indigenous languages of Sarawak (Nersil). Masters thesis, Universiti Malaysia Sarawak, (UNIMAS).

[img] PDF (Please get the password from Digital Collection Development Unit, ext : 3932 / 3914)
Generic Named-Entity Recognition For Indigenous Languages of Sarawak (NERSIL) (full).pdf
Restricted to Registered users only

Download (4MB)


The aim of this research is to create the first Named Entity Recognition (NER) system for the Sarawak Indigenous Languages (SILs), hereinafter is called NERSIL. The main goal of NERSIL is to achieve a good accuracy with regard to the identification and classification of named entities (NEs). The NEs considered in this research are Person, Location, Organisation, Date, Time, Monetary and Percentage. Generally, all these NEs carry important information about the text itself. Thus, there are targets for extraction. NER approaches can be categorised broadly as rule-based approach, machine learningbased approach, and hybrid approach. Rule-based approach relies on hand-crafted linguistic grammars. Machine learning-based approach needs a large amount of annotated training data, which is unavailable for SILs. Hybrid approach is the combination of rulebased and machine learning-based approach. NERSIL requires special attention as it is impossible to apply directly from the existing NER approaches. In this thesis, an NER system that is built by extending and modifying the existing NER approaches is presented. There are three main processes: the non-modified ANNIE (A Nearly-New IE system) NER, the adapted ANNIE to SILs, and finally the context investigation. Firstly, the input texts are submitted to an English NER, in this case ANNIE with the assumption that some NEs that appear in English texts will also occur in SIL‟s texts. At that stage, the rules for unrecognised NEs from the rules of recognised NEs are distinguished. Next, the new rules for unrecognised NEs are written and the new gazetteers for SILs are built in order to identify more NEs. However, the first two v processes are not enough to provide a good accuracy in recognising all NEs. Thus, context investigation is needed. Context investigation includes frequency analysis, triggered words filtering, and concordance analysis. The context of a NE (the left or right side of NE) will be investigated.Finally, a NER system designed for SILs will be an advancement of world knowledge. Besides, the design can be improved by incorporating the machine translation, WordNet, and adding more noise filtering (e.g. context filtering, and morphological filtering). With more research and future studies, this NER system will reach a high level of performance like the English NER work on.

Item Type: Thesis (Masters)
Additional Information: Thesis (M.Sc.) -- Universiti Malaysia Sarawak, 2013.
Uncontrolled Keywords: Named Entity Recognition (NER) system, Computer software, Software architecture, unimas, university, universiti, Borneo, Malaysia, Sarawak, Kuching, Samarahan, ipta, education, postgraduate, research, Universiti Malaysia Sarawak
Subjects: T Technology > T Technology (General)
Divisions: Academic Faculties, Institutes and Centres > Faculty of Computer Science and Information Technology
Depositing User: Karen Kornalius
Date Deposited: 20 Jul 2015 01:50
Last Modified: 19 May 2020 04:23

Actions (For repository members only: login required)

View Item View Item