Türkçe zamansal ifadelerin etiketlenmesi ve normalleştirilmesi

thumbnail.default.alt
Tarih
2021-07-29
Yazarlar
Uzun, Ayşenur
Süreli Yayın başlığı
Süreli Yayın ISSN
Cilt Başlığı
Yayınevi
Lisansüstü Eğitim Enstitüsü
Graduate School
Özet
Yapısal olmayan metinden bilgi çıkarma alanında yapılan çalışmalar, doğal dil işleme alanında önemli bir yere sahiptir. Kelime kökü bulma, kelime sözcük türü etiketleme, kelime bağımlılık yapı ağacı çıkarım gibi yapısal çalışmaların yanı sıra, son senelerde bilgi çıkarım alanında yapılan çalışmalar önem kazanmıştır. Metin içerisinde tespit edilen semantik bilginin, yapısal bir forma normalleştirilmesi, bilginin çeşitli doğal dil işleme çalışmalarında etkili biçimde kullanılabilmesi için önem arz etmektedir. Zamansal ifade işaretleme ve normalizasyon çalışması, bilgi çıkarım sistemleri içerisinde önemli bir yere sahiptir. Metin içerisinde geçen olaylar hakkında zaman, süre, sıklık, aralık gibi bilgi taşıyan ifadelere (ör. bugün, iki ay sonra, 19 Temmuz'da, her hafta) zamansal ifadeler denilmektedir. Zamansal ifadelerin tespit edilmesi ve belirtilen standarda göre normalize edilmesi başta İngilizce, İspanyolca, Almanca, Çince, Arapça gibi dillerde yaygın bir araştırma alanıdır. Literatürde, bu diller için birçok zamansal ifade işaretleme ve normalizasyon sistemleri sunulmuş, manuel veya otomatik yöntemler ile zamansal ifadeleri işaretlenmiş veri setleri yayınlanmıştır. Sunulan bu sistemlerin, veri setleri üzerinde değerlendirilmesi için semantik değerlendirme seminerleri düzenlenmiştir. Bilgimiz dahilinde Türkçe literatüründe, bu zamana kadar herhangi bir zamansal ifadeleri işaretlenmiş, yapısal bir veri bankası yayınlanmamıştır. Ayrıca, baştan sona Türkçe zamansal ifade tespit ve normalizasyon görevlerini gerçekleştiren bir sisteme, literatür incelemelerimiz sırasında rastlanmamıştır. Bu tez çalışmasında, Türkçe zamansal ifade çıkarım ve normalizasyon alanında temel bir çalışma sayılabilecek, ilk uçtan uca ve Türkçe biçimbilimsel yapısının da dahil edildiği, kural tabanlı zamansal ifade etiketleme ve normalizasyon sistemi geliştirilmiştir. Sistemin geliştirilmesi ve test aşamasında kullanılmak üzere, 109 haber metninde yer alan zamansal ifadeler manuel yöntemle işaretlenmiştir. Tez kapsamında geliştirilen bu veri seti, gelecek araştırma çalışmalarında kullanılması amacı ile ortak kullanıma açılmıştır. Geliştirlen bu sistem, yayınlanan test veri seti üzerinde çalıştırılmıştır. Sistemin performansı, zamansal ifade etiketleme çalışmalarında kullanılan doğruluk (precision) ve tutarlılık (recall) formülleri kullanılarak ölçülmüştür. Metin içerisinde geçen zamansal ifadeler %89 F1 skoru başarısı ile tespit edilirken, doğru tespit edilen ifadelerin "type" ve "value" niteliklerinin normalizasyonunda sırasıyla %89 ve %88 F1 başarısı elde edilmiştir. Gelecek çalışmalarda, hata analizi ve sistem kısıtlamaları bölümlerinde bahsedilen eksiklikler ve tavsiyler göz önünde bulundurularak, daha yüksek başarımlı Türkçe zamansal ifade işaretleme ve normalizasyon çalışmaları gerçekleştirilebilir.
Extraction of information from unstructured data is a highly appreciated research topic in natural language processing. Identification of unstructured semantic information and normalization of identified expressions have a place in the usage of semantic information effectively in many NLP applications. Temporal expression identification is taken into account as sub-categorization of entities with Date type in named entity recognition (NER) systems and attract significant attention recently in information retrieval systems. The first underlying reason to conceptualize temporal expression tagging as a separate main task from named entity recognition systems, is improving the question answering system performance on temporal questions for instance "When Einstein born?" or "When Ataturk selected as the first president?" because temporal question answering was becoming a non-trivial task in information retrieval systems. Temporal expression is a word or phrase that represents information about occurrence, repetition or elapsed time of an event or an action. A temporal expression can be absolute for example 12.01.2021 or relative such as next month to document creation time. Most of the studies in time expression tagging are concentrated on detecting 5 different temporal types: Date, Time, Duration, Interval, and Set. After identification of temporal expression, information should be converted to a structural form in order to be more useful in NLP applications afterward. This transformation step created the need for a normalization standard for detected temporal expressions. Almost all of the studies on temporal expression extraction and normalization depend on TimeML, which is a standard markup language to mark temporal expressions, temporal events and relations between these events. 7 different tag schemes are available in TimeML in order to annotate and normalize events and temporals. However, in this study, we only focus on detecting and normalization of temporal expressions excluding events and relations. Therefore only TIMEML and TIMEX3 tags are used in this work. Temporal expression identification and normalization systems are proposed for many languages for instance English, Spanish, German, Chinese and Arabic. Several distinguish approaches are selected for temporal expression tagging so far. Firstly, most preliminary works based on rule-based systems and then after hybrid methods which combine rule-based systems with machine learning systems or scheme-based, systems, were proposed. Recent temporal expression identification and normalization systems have deep learning advantages to build multilingual or more accurate systems. To develop, train, and test temporal expression tagging models, manually or automatically annotated various datasets and lexicons are proposed in many languages. Evaluation workshops are organized to evaluate temporal expression tagging systems with specified evaluation metrics. However, any proposed temporal expression tagger or temporally annotated corpora is proposed for the Turkish language up to now. In this study, we developed a morphological rule-based temporal expression identification and normalization system, ITUTime, which is able to detect and normalize Date, Time, Duration, Interval, and Set types. Our system relies on morphology-aware regular expression-based rules operating in the local context. Our system does not require a morphological analysis or a POS(Part-of-speech) tagger. Turkish is one of the morphologically rich languages. Turkish is one of the morphologically rich languages. Therefore the usage of morphological analysis tools is practical yet exhaustive for Turkish language processing. Instead of using a morphological analyzer and a morphological disambiguation tool for preprocessing the text, we created a simple lookup table for possible inflections. Most of the rule-based temporal identification and normalization systems require tokenization as preprocessing steps and their rules run over token patterns. Our temporal expression identification rules are simply running over the free text instead of tokens and this makes the proposed system free of any preprocessing steps. Due to the rule-based structure and the lack of any processing phase, our system is not able to process non-canonical text. ITUTime has 4 main submodules, which are text number normalization, detector, normalizer, and text number restoration. Initially, we convert numbers that are written in words to their corresponding numerical representations, e.g. converting two days to 2 days by using regular expressions in text number normalization module. To be able to reconstruct the original input text at the output layer, these rules keep the original word-based representations as the markup attributes. In our detector component, a set of nested rules for each temporal type are executed sequentially to detect temporal expressions in the text and determine their TIMEX3 types. In total 67 different complex regular expression rules are defined in our detector module. We provide only text regex rules to extract temporal expressions. We define a capturing group, which contains all possible inflectional suffixes to detect both nominative and inflected nouns, in regular expression rules set. If temporal is identified by a rule, extracted temporal expression is not modified by compositional or filtering rules later on and TIMEX3 type is assigned to the type of ruleset. Therefore, applying rules in a specific order becomes crucial for extracting temporal expression correctly in our approach. To produce correct outputs, we opt for applying rules in the following order: Interval, Time, Date, Duration, and Set. The aim of the normalization module is to construct a structured representation of the unstructured time expressions that are captured by the detector module. This component is composed of two rule sets, (1) a set for the exact temporals and (2) a set for the relative temporals. Normalizing exact temporals seems relatively easy, however, there are cases such as 2020'nin son çeyreği (last quarter of 2020), bu yılın son ayında (in the last month of this year), and similar expressions, which makes this task non-trivial. On the other hand, normalizing relative temporals with respect to document creation time is also challenging. To cover the possible relative temporal constructions, we have generated distinct rules considering the surrounding keyword clues. At the end of the normalization procedure, each temporal is annotated with a TIMEX3 tag. Finally, we apply a post-processing step to restore the original numbers in words converted initially in-text number normalization module. For the Turkish language, a temporally annotated dataset has not been published to the best of our knowledge. To build a temporal expression dataset in Turkish, we crawled 109 news, which are published between 3th and 19th March 2018, from a daily Turkish newspaper. Collected articles belong to distinctive categories: economy, political, breaking news, movie, travel, music. Dataset is manually annotated and we followed TIMEX3, which is described in TimeML 1.2.1 guideline, format during annotation. We have splitted the dataset into two parts: a development split (87 news) and a test split (22 news). In total, we annotated 109 articles, which contains 32,600 tokens, and end up with 1,147 TIMEX3 tags. Our tagger achieved 89% F1-score in detecting and 89% and 88% F1-score in normalization of "type" and "value" TIMEX3 attributes according to specified precision and recall calculation, which is mainly used in temporal extraction systems
Açıklama
Tez (Yüksek Lisans)-- İstanbul Teknik Üniversitesi, Lisansüstü Eğitim Enstitüsü, 2021
Thesis (M.Sc.) -- İstanbul Technical University, Graduate School, 2021
Anahtar kelimeler
Yapısal olmayan metin, Doğal dil işleme, Normalizasyon çalışması, Unstructured data, Natural language processing, Normalization
Alıntı