Generating Pashto Clitics

Aziz, Ud Din (2017) Generating Pashto Clitics. PhD thesis, Universiti Malaysia Sarawak (UNIMAS).

[img] PDF (Please get the password by email to repository@unimas.my, or call ext: 3914/ 3942/ 3933)
Aziz Ud Din ft.pdf
Restricted to Registered users only

Download (89MB)

Abstract

Automatically generating a natural text that is perceived as grammatically correct remains a challenging task. The generated text must at least be coherent, accurate, and understandable. This research concerns the automatic generation of clitics in Pashto texts, since native Pashto speakers use clitics extensively in everyday conversation and writing. A clitic is a word or particle that cannot bear accent or stress, and phonetically leans on an accented adjacent word. Pashto language is spoken in Pakistan and Afghanistan. It is one of the several languages featuring clitics. There are two main types of clitics in Pashto: Second Position (2P) clitics and endoclitics. The linguistic behaviours of these clitics are studied and formalised into rules. The design of the Pashto clitic generation system is approached in two ways. In the first approach, system generates cliticised sentences from the semantic representation of the sentences. This system has been implemented using Combinatory Categorial Grammar (CCG). The second approach operates on the surface representation of sentences. It uses syntactic pattern matching rules for the identification and generation of clitics at sentence level. In this system, a text can be generated separately, so that after the text generation step, clitic generation rules can be applied to sentences as post-processing step. This system has been implemented in Python. The main advantage of this method is the separation of clitic generation task from the text generation task. The evaluation of the proposed solutions has been mainly constrained by the non-existence of morphosyntactically annotated corpus, and language processing tools for Pashto. Notwithstanding, two independent corpora were developed. The first corpus contained semantic representations for generating 12 sentences based on Pashto CCG grammar. The second corpus consisted of256 syntactically annotated sentences to evaluate the python-based clitic IV generation system. The system is capable of generating all Pashto clitics including endoclitics, the most challenging clitic due to many constraints for its generation. All of the target sentences are successfully realised by the CCG grammar. The python-based Pashto clitic generator system achieves an accuracy of 89.62% on the test corpus. Incorrectly generated systems by the python-based generator have been fed to CCG generator to evaluate the agreement between the two systems. The accuracy achieved in this case is 87.5%.

Item Type: Thesis (PhD)
Additional Information: Thesis (PhD.) - Universiti Malaysia Sarawak , 2017.
Uncontrolled Keywords: Clitic, pashto, clitic generation rules, combinatory eategorial grammar, unimas, university, universiti, Borneo, Malaysia, Sarawak, Kuching, Samarahan, ipta, education , Postgraduate, research, Universiti Malaysia Sarawak.
Subjects: Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Divisions: Academic Faculties, Institutes and Centres > Faculty of Computer Science and Information Technology
Faculties, Institutes, Centres > Faculty of Computer Science and Information Technology
Academic Faculties, Institutes and Centres > Faculty of Computer Science and Information Technology
Depositing User: Gani
Date Deposited: 11 Sep 2020 07:31
Last Modified: 08 Mar 2023 04:59
URI: http://ir.unimas.my/id/eprint/31749

Actions (For repository members only: login required)

View Item View Item