Event extraction from Turkish Trade Registry Gazette

thumbnail.default.alt
Tarih
2023-05-16
Yazarlar
Demirtaş, İrem Nur
Süreli Yayın başlığı
Süreli Yayın ISSN
Cilt Başlığı
Yayınevi
Graduate School
Özet
The Turkish Trade Registry Gazette is the official gazette published by The Union of Chambers and Commodity Exchanges of Türkiye. Companies announce crucial events like change in management, change in capital or bankruptcy in the gazette. In many industries, the gazette is used as an important source of information and intelligence. The gazette has a history of almost 70 years. The issues are also publicly available on the internet in image PDF format. This format is both hard to read for humans and hard to process for computers. On top of that, since the gazette has been published in newspaper layout, the text is usually in columns. In later issues of the gazette, some information can be given in tables. Although optical character recognition looks like a viable option for text extraction, it must be supported with image processing. To extract information from the Turkish Trade Registry Gazette, announcements of selected companies between January 2014 and August 2022 were collected. The collected data consists of PDF documents of gazette pages for the selected companies and related metadata. The metadata contains information about issue number, page number and what type of announcement the company has on the given page. Text was extracted using an image processing and optical character recognition pipeline. After the text was extracted, it was manually annotated. Since the text is extracted from the whole document, it contains multiple announcements. Thus, announcement boundaries were annotated. Based on the most important and frequent announcement types encountered in the Turkish Trade Registry Gazette, four event types were defined: Composition with Creditors, Notice to Creditors, Change in Management and Change in Working Capital. Events consist of triggers that signal the occurrence of the event, event arguments that specify general and event-specific entities involved in the events and event roles that define the relations between triggers and arguments. Using these definitions, triggers, arguments and roles were defined and annotated for each of these event types. Using announcement boundaries, an announcement splitting model was trained. After all collected announcements were split using this model, announcements listed in the metadata table were located in the pages and an announcement classification dataset with 16 announcement types was created. Using this dataset, an announcement classification model was trained. Since announcements are documents of varying lengths, the effect of context was observed. The announcement classification model achieves an F1 score of 0.83. For trigger and argument extraction, experiments were carried on in different settings. The effect of IOB tags, an added CRF layer and handling argument and trigger extraction separately were observed. The best performing model was determined to be the two-stage one that does not use IOB tags or a CRF layer, with a micro F1 score of 82.5. For event extraction, a rule-based model and Doc2EDAG [1] were explored. Although the rule-based model performs better on simpler event types, Doc2EDAG was found to be better with a micro F1 score of 73.9 on gold arguments and 54.2 on predicted arguments. Four approaches were proposed to improve the performance. Of these, removing the CRF layer and applying transfer learning yielded improved micro F1 scores of 74.9 and 75.2 over gold arguments and 60.5 and 62.9 over predicted arguments, respectively. The other two proposed methods, namely, turning off path expansion memory and field-aware path expansion yielded poorer results than the baseline.
Açıklama
Thesis (M.Sc.) -- İstanbul Technical University, Graduate School, 2023
Anahtar kelimeler
information processing, bilgi işleme, doğal dil işleme, natural language processing
Alıntı