Yeni Bir Sözdizimsel İşaretleme Yönteminin Kullanımıyla Türkçe'nin İstatistiksel Ayrıştırma Başarımının Artırılması

thumbnail.default.alt
Tarih
2015-06-26
Yazarlar
Sulubacak, Umut
Süreli Yayın başlığı
Süreli Yayın ISSN
Cilt Başlığı
Yayınevi
Fen Bilimleri Enstitüsü
Instıtute of Science and Technology
Özet
Bu çalışmada, mevcut tek ağaç yapılı Türkçe bağlılık derlemi olan ODTÜ-Sabancı Türkçe Ağaç Yapılı Derlemi'nde (OSD) kullanılan bağlılık gramerinin eleştirel bir analizi verilmekte ve ardından Türkçe için yeni ve daha gelişmiş bir bağlılık işaretleme çerçevesi önerilmektedir. Yeni çerçeve minimallik ve elle işaretlemenin kolaylaştırılması üzerinde durmakta, orijinal çerçevenin sahip olduğu 26 bağlılık etiketine karşılık yalnızca 16 bağlılık türü ile ifade gücünden bir şey kaybetmeden daha açık ve anlaşılır olabilmektedir. İşaretleme çerçevesinin ilk uygulamaları olarak çalışma kapsamında iki yeni ağaç yapılı derlem tanıtılmaktadır: 1) OSD'nin İTÜ-ODTÜ-Sabancı Ağaç Yapılı Derlemi (İOSD) adlı, yüksek başarım sergileyen biçimbirimsel etiketler ve yeni bağlılık türleriyle işaretlenmiş yeni bir sürümü, ve 2) Web üzerinde kullanıcıların girdiği kural dışı cümlelerden derlenerek sözdizimsel olarak işaretlenmiş ilk Türkçe derlem olan İTÜ Web Ağaç Yapılı Derlemi (İWD). Tanıtılan yeni derlemlerimizin her ikisinde de derin bağlılık işaretlemesi yapılarak kaynaklar gelecek anlamsal işlev etiketleme çalışmalarının da yararlanabileceği şekilde düzenlenmiştir. Önerilen işaretleme yordamlarının temel alınan işaretleme çerçevesine göre başarımını ölçebilmek amacıyla sunulan ağaç yapılı derlemlerde ayrıntılı incelemeler yürütülmüş ve çalışma kapsamında bildirilmiştir. Verilen deney sonuçları başarım artışının yalnızca bağlılık etiket kümesinin küçülmesinden çok daha anlaşılır ve tutarlı olan yeni bağlılık gramerinin öngördüğü işaretleme yöntemiyle ilgili olduğunu ortaya koymaktadır. Çalışmada tanıtılan çok iyeli bağlılık temsili, altın standart kümesinde işaretlenen iyelerden herhangi birinin uygun olması durumunda bir öngörünün kabul edilmesine dayalı yeni bir değerlendirme metriğinin işletilmesine olanak sağlamaktadır. Ayrıştırma deneylerinin sonuçlarına göre İOSD için tanıtılan en iyi model tek iyeli değerlendirmede %75.1, çok iyeli değerlendirmede ise %75.7 etiketli bağlama skoru alarak OSD için şimdiye dek elde edilmiş en iyi etiketli bağlama skoru olan %65.9'u büyük farkla geçmektedir. Ayrıca İWD'nin çapraz doğrulanması sonucu tek iyeli değerlendirmede %78.7, çok iyeli değerlendirmede ise %80.1 olmak üzere yüksek umut vaat eden etiketli bağlama skorları elde edilmektedir.
In this work, we present a critical analysis of the dependency grammar that has come to be the de facto standard for Turkish language processing studies. Although widely recognized and used in several Turkish corpora including the well-known METU-Sabancı Treebank (MST), the only major syntactically annotated Turkish corpus to date, the grammar is partly outdated, improvable and extensible. Moreover, the METU-Sabancı Treebank itself is often criticized for its inconsistent annotation and difficulty of parsing. Many recent studies centered around the syntactic parsing of Turkish have focused on fine-tuning specific aspects of their parsing frameworks and failed to make a pivotal overall progress in their parsing performances. We take a detour from specific case studies that would only yield local performance improvements, and delve into the entire structure of the annotation framework. We investigate the current Turkish annotation conventions in detail, identify any flaws and deficiencies with respect to both manual annotation and automatic parsing, and then propose measures that might be taken to alleviate these issues. Furthermore, as web data become increasingly available for study and the ability to efficiently parse non-canonical sentences gain importance, we place special emphasis on making dependency annotation as lenient on non-canonical texts as possible. The extent by which the colloquial language employed by social web users differs from well-typed formal language is indeed very large, and it is often not enough to orthographically normalize non-canonical sentences in a pre-processing routine to render them as successfully parsable as edited formal texts. As part of this work, we also attempt to parametrize the differences of the language of the web, and likewise suggest what morphosyntactic reforms would likely improve parsing performances. In accordance with our findings, we also propose a new, improved dependency annotation framework for Turkish. The proposed framework additionally focuses on minimalism and ease of manual annotation, featuring only 16 dependency types that are decidedly more coherent and intuitive compared to the 26 labels of the original framework. We justify all of our proposed changes in the entailed dependency grammar from the original version by either showing conformity with the design principles we explain or demonstrating overlap with universally recognized conventions that have been long since proved. As the first implementations of the proposed annotation framework, we introduce two new treebanks: 1) A new version of the METU-Sabancı Treebank keeping the same token structure and morphosyntactic features but reannotated with the new dependency types, for which we propose the name ITU-METU-Sabancı Treebank (IMST) in recognition of the considerable previous effort on the original treebank as well as our contribution, and 2) The ITU Web Treebank (IWT), the first Turkish corpus composed of non-canonical user-input sentences extracted from the web, annotated from the ground up in normalization, morphology and syntax layers. Both of our new corpora are marked for deep dependencies in order to support future semantic role labeling studies. We do not establish any hierarchy between deep and surface dependencies, and rather employ a basic approach that simply supports multiple heads for a single constituent. Although this notation makes our syntactically annotated sentences incompatible with most syntactic parsers in common use, it is straightforward to remap from the multi-headed raw sentences to single-headed projections whenever necessary, and so it boosts the expressiveness of our syntactic annotation without incurring any loss in applicability. For our parsing tests that we discuss in the later sections, we use two elementary single-head choosing methods as a precursor to smarter head choosing routines that may be developed for future work. Although constituents are conventionally annotated with a single head in dependency parsing, the practice is not always beneficial as there may be more than one head for a dependent that would make sense given clausal structures within the sentence containing the dependent. In such cases, while automatic parsers may predict a meaningful head for a given dependent, the gold-standard validation set may be annotated with another head that is also meaningful, but would still cause the prediction to be determined as incorrect simply because the two heads do not match. We mean for the newly-introduced multi-headed representation to also help in alleviating false negatives caused by such scenarios, by use of a new evaluation metric that we call "relaxed evaluation" (as opposed to the conventional "strict evaluation") able to validate predicted dependencies that match any one of the heads designated in the gold-standard. After our discussions, we present our detailed empirical investigations on the new treebanks in order to demonstrate the impact of our proposed annotation schemes with respect to the original framework. We perform cross-validation on all of our models and cross-check parsing models trained from each combination of training sets and single-head choosing routines with each other where appropriate. We provide the figures resulting from our parsing tests and discuss their significance in detail. Additionally, we conduct a series of targeted remapping tests in order to make sure that certain annotation scheme changes were indeed well-founded and effective. Furthermore, our experiments indicate that the parsing performance increases we attain are not caused by the reduction of the dependency label set, but rather related to our more coherent annotation framework prescribed by the new grammar. Our final tests show that our best model for the IMST attains labeled attachment scores of 75.1% for strict evaluation and 75.7% for relaxed evaluation, surpassing the state-of-the-art parsing score of 65.9% by a large margin. Cross-validation of the IWT also yields 79.7% for strict evaluation and 80.1% for relaxed evaluation for the best model. Considering these scores, our new resources reveal up to nearly 12 percentage points improvement on the performance of parsing web data.
Açıklama
Tez (Yüksek Lisans) -- İstanbul Teknik Üniversitesi, Fen Bilimleri Enstitüsü, 2015
Thesis (M.Sc.) -- İstanbul Technical University, Instıtute of Science and Technology, 2015
Anahtar kelimeler
Doğal Dil İşleme, Derlem Dilbilimi, Natural Language Processing, Corpus Linguistics
Alıntı