Developing morphology disambiguation and named entity recognition for amharic

dc.contributor.advisor Tantuğ, Ahmet Cüneyd
dc.contributor.author Jibril, Ebrahim Chekol
dc.contributor.authorID 504122515
dc.contributor.department Computer Engineering
dc.date.accessioned 2025-02-14T11:24:35Z
dc.date.available 2025-02-14T11:24:35Z
dc.date.issued 2024-11-01
dc.description Thesis (Ph.D.) -- Istanbul Technical University, Graduate School, 2024
dc.description.abstract Morphological disambiguation is defined as the process of selecting the correct morphological analysis for a given word within a specific context. Developing natural language processing (NLP) applications is very challenging without effective morphological disambiguation. Semitic languages, including Arabic, Amharic, and Hebrew, present increased challenges for NLP tasks due to their complex morphology. Named Entity Recognition (NER) plays a crucial role as a preliminary phase in various downstream tasks such as machine translation, information retrieval, and question answering. It is an essential component of information extraction, used to identify proper names and temporal and numeric values in open domain text. The NER task is particularly difficult for Semitic languages because of their highly inflected nature. In this research, new datasets for developing word embeddings and performing morphological disambiguation are collected, and a relatively large dataset is annotated and made publicly available. Multiple Amharic named entity recognition systems are constructed utilizing contemporary deep learning techniques, including transfer learning with RoBERTa—a transformer-based model. Additionally, Bidirectional Long Short-Term Memory (BiLSTM) models are employed and integrated with a conditional random fields layer to enhance performance. A BiLSTM model is also developed specifically for morphology disambiguation using a newly prepared dataset. The Synthetic Minority Over-sampling Technique (SMOTE) is utilized to address the imbalance in class distribution within the datasets. The study achieves state-of-the-art results for Amharic named entity recognition, attaining an F1-score of 93% with RoBERTa, and achieves an accuracy of 90% for morphology disambiguation. In Chapter 1, the dissertation establishes the context by introducing the Amharic language and its significance within Ethiopia and the broader Semitic language family. It outlines the primary challenges associated with processing Amharic texts due to its rich morphological structure and orthographic variations. Key research questions and objectives are formulated, with a focus on advancing NLP capabilities for Amharic through improved morphological disambiguation and NER. Chapter 2 provides a comprehensive overview of Amharic's linguistic properties, emphasizing its script and morphological complexity. The lack of capitalization and extensive character set are identified as challenges for NLP. This chapter provides the foundational understanding necessary for developing tools capable of handling Amharic's unique linguistic features. In Chapter 3, the focus is on developing effective models for Amharic morphology disambiguation. Related work from other Semitic languages is reviewed, and the construction of relevant datasets is detailed. The application of BiLSTM models to tackle Amharic's morphological properties is described, along with the experimental setup and evaluation metrics, which demonstrate significant improvements in accuracy. Chapter 4 addresses the challenges of performing NER in Amharic, a task complicated by the language's morphological and orthographic features. A new, extensively annotated Amharic NER dataset is introduced. The chapter evaluates various model architectures, including BiLSTM-CRF and RoBERTa, a transformer-based model, and discusses the resulting enhancements in model performance. In Chapter 5, the research findings are synthesized, emphasizing the dissertation's contributions to advancing Amharic NLP through the development of high-performing models and comprehensive datasets. Recommendations for future work include enhancements in tools for spelling correction and further expansion of NER datasets to improve the system's capabilities. Through this comprehensive approach, the dissertation significantly contributes to the field of computational linguistics for low-resource languages, offering novel insights and methodologies that can be adapted for other morphologically rich languages.
dc.description.degree Ph.D.
dc.identifier.uri http://hdl.handle.net/11527/26451
dc.language.iso en_US
dc.publisher Graduate School
dc.sdg.type Goal 9: Industry, Innovation and Infrastructure
dc.subject Amharic
dc.subject Amharca
dc.subject morphology disambiguation
dc.subject morfolojik belirsizlik
dc.title Developing morphology disambiguation and named entity recognition for amharic
dc.title.alternative Amharca morfolojik belirsizliği giderme ve adlandırılmış varlık tanıma geliştirilmesi
dc.type Doctoral Thesis
Dosyalar
Orijinal seri
Şimdi gösteriliyor 1 - 1 / 1
thumbnail.default.alt
Ad:
504122515.pdf
Boyut:
1.28 MB
Format:
Adobe Portable Document Format
Açıklama
Lisanslı seri
Şimdi gösteriliyor 1 - 1 / 1
thumbnail.default.placeholder
Ad:
license.txt
Boyut:
1.58 KB
Format:
Item-specific license agreed upon to submission
Açıklama