FBE- Bilgisayar Mühendisliği Lisansüstü Programı - Doktora

Bu koleksiyon için kalıcı URI

http://hdl.handle.net/11527/13

Gözat

Building of Turkish propbank and semantic role labeling of Turkish

(Institute of Science And Technology, 2018-01-16) Şahin, Gözde Gül ; Adalı, Eşref ; 504122519 ; Computer Engineering ; Bilgisayar Mühendisliği

Understanding a human language has been a dream of manhood for more than a decade. Although early science fiction movies have predicted that dream would have come true by now, it has not. The reasons are varied however ambiguity, the need for context, common sense knowledge, the variety in word/sentence structures can be considered as such. There have been attempts to disambiguate word meanings, analyze language structures, and model common sense knowledge to reach this goal, however, it is ongoing research with many subfields. In this thesis, we are interested in one of its subfields: shallow semantic parsing or semantic role labeling (SRL). It aims to dissolve the understanding problem into identifying action/event-bearing units and their participants. In that way, independent from the structure of the sentence, the same representation can be produced, (e.g. "Economy grew by 5%" and "The growth of the economy was 5%" or "The window broke" and "Stone broke the window"). The output representations of this task can benefit other natural language understanding tasks such as information retrieval, sentiment analysis, question answering, and textual entailment. In order to perform this task a resource that contains the meanings of action/event bearing units (in our case verbs) and their frequent participants, named Proposition Bank (PropBank), should be created to guide the machine learning techniques. Unfortunately creating such a resource requires a large amount of time, budget, and linguistic experts. Therefore has not seen possible for low-resourceful languages like Turkish. In this thesis, we aim to address this issue by incorporating crowd intelligence into the construction workflow. We design a novel workflow that requires a minimum number of experts with linguistic knowledge. They have been employed for (1) the first crucial step, where semantic frames are manually created, (2) supply quality control mechanism by labeling a small number of questions, and (3) double-check the answers of crowd taskers when taskers could not agree on an answer. Other challenges to creating such a resource are posed by the rich morphology of Turkish. To address this extreme production of new words that cause a theoretically infinite number of action-bearing units, we propose to exploit the semantic knowledge that is acquired by root verbs composed with regular morphosemantic features such as case markers. We evaluate our overall approach for the building of Turkish PropBank by various inter-annotator metrics and show that our resource is of high quality. Though creating a resource is crucial, not enough for automatic labeling of semantic roles. The second part of this thesis focuses on building such automatic methods that are suitable for the Turkish language. For that purpose, we adopt a system that uses a deterministic machine learning model based on linguistic features designed mostly for high-resource, morphologically poor languages. However, the Turkish language poses the following challenges: (1) a significant amount of out of vocabulary words (words that have not been seen in the dictionary) (2) the small number of training instances, and (3) high syntactic variance among predicates and their arguments. These issues cause very sparse features that complicate the learning process of the statistical system. We address these challenges by (1) designing better features that exploit the regularity of morphosemantics, thus not as sparse as previous ones; and (2) taking advantage of pretraining on unlabeled data, in other words, exploiting prior knowledge on Turkish words that have been learned through word embeddings. We show that our approach yields to the first robust Turkish SRL system with an F1 score of 79.84. Our experiments with training data size and the features show that (1) morphosemantic features are vital for Turkish SRL; (2) a reasonable SRL system can be trained with proposed features on 60% of the available data; (3) performance greatly degrades in the absence of high-level syntactic features and (4) continuous features model complex interactions between information levels and lead to further improvement in the scores. Although the statistical SRL system has been shown to be successful in the presence of gold tags, it suffers from accumulating errors of external NLP tools that are required for feature extraction. To address this problem, we introduce a neural SRL system that employs bi-directional long-short-term-memory (LSTM) units to operate on subword units that do not require syntactic preprocessing (or only minimal). Unlike previous techniques that use pre-trained word embeddings, the proposed model generates a word embedding by composing the subword units. Available subword composition techniques did not make any distinctions between morphology types. In order to distinguish derivational morphology from inflectional morphology, we propose a linguistically motivated composition technique and systematically analyze the effect of subword and composition types. We show that (1) character-based models with bi-LSTM composition perform similar to models that use morphological information for languages with poor morphology, whereas at least 3 percentage point drop is observed on F1 scores for morphologically rich languages and (2) linguistically motivated composition method surpasses other techniques for Turkish SRL. We evaluate various techniques to combine multiple subword units in order to test whether subwords learn complementary features for argument labeling. We show that character and char-trigram combination improve the scores in all cases, whereas combining character with morphology does not help most languages with rich morphology, suggesting that characters do not capture any information that is not already embedded in morphological models. Finally, all resources are made accessible to encourage researchers to work on the Turkish language.

Gözat

Konu "Anlam bilim , Dilbilim, Yapay sinir ağları , Yapay zeka" ile FBE- Bilgisayar Mühendisliği Lisansüstü Programı - Doktora'a göz atma

Sayfa başına sonuç

Sıralama Seçenekleri