Abstract meaning representation of Turkish

thumbnail.default.placeholder
Tarih
2022
Yazarlar
Oral, Kadriye Elif
Süreli Yayın başlığı
Süreli Yayın ISSN
Cilt Başlığı
Yayınevi
Graduate School
Özet
In this thesis, we focus on the Abstract Meaning Representation (AMR) for Turkish. The AMR is a sentence-level representation that summarises all semantic aspects of sentences. Its goal is to create representations that abstract from syntactic features. This is an attempt to group sentences with the same meaning in a semantic representation, regardless of the syntactic features of the sentences. It is also easily readable by humans, which is very convenient for researchers who want to conduct research in this area. AMR is designed for the English language, but can be adapted to adapt to other languages by taking into account language-specific issues. To accomplish this task, it is mandatory to create an AMR guideline that defines language-specific annotation rules. In this thesis, we present Turkish AMR representations by creating an AMR annotation guideline for Turkish. Turkish is a morphologically rich, pro-drop and agglutinative language, which causes it to deviate from English AMR in its representations. In creating the Turkish guideline, we meticulously examine Turkish phenomena and propose solutions to define AMR representations for these deviant points. Besides, we present the first AMR corpus for Turkish that contains 700 AMR annotated sentences. Unfortunately, the creation of such resources is not an easy task, as it requires linguistic training and a large amount of time, and also requires a systematic annotation strategy. We adapt the model-annotate-model-annotate strategy to our annotation task, i.e., instead of dealing with all phenomena at once, we follow a stepwise path. First, we follow a data-driven approach and handle Turkish specific structures that are present in the data. In the second iteration, we use knowledge bases such as Turkish dictionaries and grammar books to cover all linguistic phenomena. This strategy allows us to build a corpus simultaneously. Instead of annotating the sentences from scratch, we use a semi-automatic annotation approach where a parser first processes the sentences and outputs the AMR graphs, which are then corrected/re-annotated by annotators (two native speakers). We implement a rule-based parser by inspiring the methods used in the literature. Our rule-based parser is very similar to the transition parsers, but its actions are driven by the rule list rather than an oracle. We design this parser in this way because our goal is to develop an unsupervised parser that utilizes the available sources. We evaluate our proposed solutions and the rule-based parser using the semantic match score (Smatch). This score shows the quality of our corpus and the accuracy of our parser. The inter-annotated agreement between our annotators is 0.89 Smatch score, the rule-based parser achieves a Smatch score of 0.60, which is a strong baseline for the Turkish AMR parsing task. The final part of this paper deals with the development of a data-driven AMR parser. We formalize our parser as two steps containing a pipeline of multiple classifiers, each with different functionality. The first step of the data-driven parser is to identify concepts to be used in the AMR graphs. Nine separate classifiers are trained for this task.
Açıklama
Thesis (M.Sc.) -- İstanbul Technical University, Graduate School, 2022
Anahtar kelimeler
Abstract Meaning Representation
Alıntı