Türkçe Tümcelerin Öğelerinin Bulunması
Türkçe Tümcelerin Öğelerinin Bulunması
Dosyalar
Tarih
2013-08-05
Yazarlar
Coşkun, Nilay
Süreli Yayın başlığı
Süreli Yayın ISSN
Cilt Başlığı
Yayınevi
Fen Bilimleri Enstitüsü
Institute of Science and Technology
Institute of Science and Technology
Özet
Doğal dil işleme çalışmaları, insanların kullandıkları doğal dillerin işlenerek makineler tarafından anlaşılabilmesini amaçlar. Gelişen teknoloji ile akılı makineler insan hayatının vazgeçilmez bir parçası olmuştur. İnsanları her geçen gün makinelerden beklentileri yükselmektedir. Makineler ile insanlar arasındaki iletişim, insanlar arasındaki iletişimde olduğu gibi dil ile gerçekleşmektedir. Bu nedenle doğal dillerin makineler tarafından anlaşılması önemli bir yere sahip olmuştur. Doğal dillerin makineler tarafından anlaşılabilmesi, yazıların otomatik bir dilden başka bir dile çevrilmesi, soru cevap sistemleri, özetleme, konuşma sistemleri, bilgi sağlama gibi bir çok alanda insan hayatına kolaylık sağlamaktadır. Dilin insan beyni tarafından nasıl işlendiğini inceleyen çalışmalara bakıldığında, beynin duyduğu sözcükler arasında ilişki kurmaya çalışarak çözümleme yaptığı görülmektedir. Bu sonuçlar, dili makineler için işlerken izlenecek yöntemler için fikir verebilir. İnsan beyni, duyduklarını çözümlerken dilin gramerine göre de farklı işlevler göstermektedir. Bu sonuçtan yola çıkılarak da bir dil için yapılan bir çalışmanın başka gramer türüne sahip olan dillerde işe yaramayacağı sonucuna varılabilir. Türkçe Ural-Altay dil ailesinin Altay koluna bağlı bitişken bir dildir. Sözcüklerin sonuna ekler eklenerek sonsuz sayıda sözcük türetilebilir. Sözcük dizilişi bakımından serbest dizilimli bir dil olup, tümce içerisinde sözcükler vurgu yapılmak istenen öğeye göre yer değiştirebilir. Vurgulanmak istenen öğeler tümcenin yüklemine daha yakındır. Tümce içerisinde bazı öğeler veya öğelerin bir kısmı düşebilir. Tümce çözümlemesi tümce içerisindeki sözcüklerin arasındaki ilişkilerin belirlenerek görevlerinin belirlenmesidir. Tümce çözümlemesi için farklı yöntemler kullanılabilir. Bu tez çalışmasında önerilen ve uygulanan yöntem tümce içerisindeki ad kümelerini bularak, ad kümelerinin görevlerinin bulunmasıdır. Ad kümelerinin tümce içerisinde aldığı görevlere tümcenin öğeleri denir. Tümce özne, yüklem, tümleç, nesne gibi öğelerden oluşmaktadır. Tümcenin öğelerini bulmak için ilk aşamada tümce içerisindeki ad kümeleri bulunur. Ad kümesi, tümce içerisinde birlikte bulunarak bir anlam veya nesne ifade eden sözcük veya sözcükler için kullanılan isimdir. Ad kümelerini bulmak için Türkçede sözcüklerin bir arada bulunma kurallarından faydalanılmıştır. Bu kuralları uygulamak için tarafımızdan yazılan programda tümce içerisindeki ad kümeleri ve tümcenin yüklemi işaretlenir. İkinci aşamada bulunan ad kümelerinin tümce içerisinde aldığı görevler bulunur. Görevleri bulmak için öğelerin tümce içerisinde bulunma kuralları çıkarılmıştır. Bu kurallar, Türkçede farklı tümce yapıları göz önüne alınarak tanımlanmıştır. Türkçede yapısı bakımından 4 farklı tümce çeşidi bulunmaktadır. Bunlar basit tümce, birleşik tümce, girişik tümce ve sıralı tümcelerdir. Öğeleri bulmak için tanımlanan kurallar farklı tümce tipleri için farklılıklar göstermektedir. Bulunan ad kümelerinin tümce içerisindeki görevlerini bulmak için tanımlanan kuralları çalıştıran bir araç yazılmıştır. Araç kuralların çalıştırılmasıyla bulunan öğeleri tümce üzerinde işaretlemektedir. Öğeleri bulmak için tanımlanan kurallar bazı belirsiz durumlarda yeterli olmamaktadır. Belirsiz durumları gidermek için farklı yöntemler uygulanmıştır. Belirsiz durumlarda bulunan öğe adaylarının hangi öğe olduğunu bulabilmek için bir takım doğrulama yöntemleri geliştirilmiştir. Tümce içerisinde yüklem-özne kişi uyumu, eylemler ve özneler arasındaki ilişkiler ve eylemlerin tümce içerisinde birlikte kullanılabildiği öğelerden faydalanılarak doğrulamalar yapılmıştır. Doğrulamalar için eylemler ve özneler arasındaki ilişkileri içeren bir liste oluşturulmuştur. Eylemlerin tümce içerisinde birlikte kullanılabileceği öğeleri bulmak için Türk Dil Kurumu tarafından hazırlanan nesnesiz eylemler, -e hal eki alan eylemler, -de hal eki alan eylemler, -den hal eki alan eylemler olarak gruplanan eylem listesi kullanılmıştır. Geliştirilen yöntem farklı derlemler üzerinde sınanmıştır. Sınama sonucunda yöntemin başarısının, tümce içerisindeki sözcüklerin doğru çözümlenebilmesine ve tümce türlerine göre değiştiği görülmüştür. Sözcüklerin türlerinin doğru olarak bulunabilmesi ad kümelerinin doğruluğu açısından önemlidir. Görülen bir diğer sonuç ise kural tabanlı sistemlerin başarısının basit tümcelerde daha yüksek olduğudur.
Natural Language Processing studies aim to be processed of natural languages by machines analyzing language elements. With the improving technology, smart machines become an indispensable part of human life. Human expectations of smart machines are increasing day by day. The aims of our approach to extract constituents of a Turkish sentence such as subject, verb, object and adverbial phrases. Since Turkish is an agglutinative and free constituent language, finding such units is an ambiguous issue. To overcome this ambiguity, we propose a rule-based approach. The method we proposed and used in our project consists of following steps: Tokenization, morphological analysis, and disambiguation of the morphological analysis results. After applying these steps, the rules of various Turkish sentences were applied to discover the role of each constituent. The rules of Turkish sentences were defined regarding the structural types of Turkish sentences. We used every different sentence type to test our method. As a result of these tests, we indicated that the success of the results depends on complexity of sentence structures. Linguistic researches suggested that sentences are not mere strings but they have a structure that organized into subgroups of words called constituents. The structural rules that govern composition of constituents are called grammar. Human brain processes the sentences depends on the grammar and constituents formed the sentence. This claim makes finding constituents in sentences important for the natural language processing researches. In the structural languages such as English, the relationship between sentence elements relies on the order of words. The languages like Turkish are agglutinative languages where inflectional and derivational morphemes get affixed to a root to convey subject-object relationship and constituents order can change freely. This feature required several issues to be solved linguistically and computationally in natural language processing studies in Turkish. Turkish basically has 5 main types and 8 subtypes of sentence structures. Finding verb phrases on different sentence structures were not a challenging issue but finding subject needed more work. In order to overcome this problem, we developed a table, which formulate the relationship between verbs and subject types to benefit from sematic relations of them. Languages as in the communication between humans provide the communication between machines and human. Therefore, to be understood of natural languages by machines is important. Processing of natural languages is important for language translation, automated question answering, speech systems, and information retrieval systems. The researches of processing language in the human brain show that brain finds the relationship between heard words. These results give idea for natural language processing studies. Human brains works differently while processing different types of grammars. According to this result, the method processes a language doesn’t work for a different grammar. Constituent analysis is a challenging issue for natural language processing researches. Different methods were used in the recent researches about constituent analysis. Nivre introduces dependency grammar and dependency parser, which plays a fairly marginal role in constituent analysis. Eryiğit, Nivre and Oflazer uses dependency grammar in their study of Dependency Parser of Turkish Cetinoğlu introduces a large-scale grammar implementing Lexical Functional Grammar (LFG) for Turkish. The method Çetinoğlu proposes is based on extracting noun phrases regarding defined rules and analyzing the extracted units in the grammar. Kutlu’s study proposes the first noun phrase chunker for Turkish using dependency parser. Turkish is an agglutinative language from Altaic Language Family. Attaching suffixes to words can generate indefinite number of words. Constituents can change their position to emphasize the meaning of any constituent. The constituent to be emphasized is placed near the verb. Constituents in a sentence can drop. Analysis of sentence is to define the role of the words extracting relationship between words. Different methods can be applied to analyze the sentences. The method proposed and applied in this thesis aims to find noun phrases and their roles. The roles of constituents in the sentence are subject, verb, object and indirect object. To find constituent roles in the sentences, the noun phrases are extracted firstly. To find the noun phrases, the rules of the relation of the words in the sentence are defined. Software is developed to apply these rules and find noun phrases and verb of the sentence. In syntactic analysis, constituent represents a word or a group of words that function as a single unit in the sentence. Well-known constituents for all languages are simply Subject, Predicate and Object. In Turkish sentences, constituents have different roles depending on the meaning they give the sentence. Subject and the predicate are the main constituents of a sentence. Predicate corresponds to the main verb and any verb phrase derived from noun phrases. Subject is the constituent that predicate says something about and mostly nominative noun phrase in the sentence. Subjects may not be stated entirely in Turkish sentences. Object is the constituent that is acted by subject. Objects types are determined regarding forms they have in the sentences such as nominative (Direct Object.Nom), accusative (Direct Object.Acc), dative(Indirect Object.Dat), locative(Indirect Object.Loc), or ablative(Indirect Object.Abl). The other constituents are adverbial clauses and prepositional phrases that prepositional phrases may be considered as an adverbial clause. Adverbial clause is a DC functioning as adverb in the sentence. Prepositional phrases are the units attached to prepositions. In the second step the roles of noun phrases in the sentence is defined. In order to find the roles, rules are defined considering different structure types of Turkish sentences. Turkish has 4 different structure types of sentence that are simple sentence, compound sentence, complex sentence and complex-compound sentence. The rules are different in the different sentence structures. We developed software to find constituent roles executing applied rules. Constituent roles are tagged in the sentence. The rules are not enough for some ambiguous cases. To disambiguation of these cases, different methods are applied. To define the role of candidate constituent’s role in ambiguous cases, some validation methods are implemented. The subject-verb aggregation, semantic relations between verbs and subjects and the constituents that can be used together with the verbs are considered for validation. To validate the constituents, a list includes the semantic relations between subjects and verbs are created. The constituent forms such as dative, locative, ablative and accusative which can be used with the verbs obtained from TDK. The groups of these verbs are used for validation also. The implemented method was tested on different corpuses. According to results of the tests, the success of disambiguation of the words in a sentence and sentence structures is an important factor for the success of the proposed and applied method. Rule based method’s success is higher in simple sentences. The method used was tested with real life data from Turkish newspapers. Test data was separated 3 different groups, which contains sentences regarding their structures. The first group includes only simple sentences and the second group includes with complex, compound and complex-compound sentences. The final group has all sentence types randomly. The results show that the success of the rule-based methodologies is higher in simple sentences. Table-2 shows that precision, recall and f-measure values for subjects in simple sentences. According to the values in table the accuracy 0.84. The test results indicate that, accuracy decreases with the increasing sentence length and sentence structure complexity. The results are promising since they can be improved with new rules working with Turkish linguists.
Natural Language Processing studies aim to be processed of natural languages by machines analyzing language elements. With the improving technology, smart machines become an indispensable part of human life. Human expectations of smart machines are increasing day by day. The aims of our approach to extract constituents of a Turkish sentence such as subject, verb, object and adverbial phrases. Since Turkish is an agglutinative and free constituent language, finding such units is an ambiguous issue. To overcome this ambiguity, we propose a rule-based approach. The method we proposed and used in our project consists of following steps: Tokenization, morphological analysis, and disambiguation of the morphological analysis results. After applying these steps, the rules of various Turkish sentences were applied to discover the role of each constituent. The rules of Turkish sentences were defined regarding the structural types of Turkish sentences. We used every different sentence type to test our method. As a result of these tests, we indicated that the success of the results depends on complexity of sentence structures. Linguistic researches suggested that sentences are not mere strings but they have a structure that organized into subgroups of words called constituents. The structural rules that govern composition of constituents are called grammar. Human brain processes the sentences depends on the grammar and constituents formed the sentence. This claim makes finding constituents in sentences important for the natural language processing researches. In the structural languages such as English, the relationship between sentence elements relies on the order of words. The languages like Turkish are agglutinative languages where inflectional and derivational morphemes get affixed to a root to convey subject-object relationship and constituents order can change freely. This feature required several issues to be solved linguistically and computationally in natural language processing studies in Turkish. Turkish basically has 5 main types and 8 subtypes of sentence structures. Finding verb phrases on different sentence structures were not a challenging issue but finding subject needed more work. In order to overcome this problem, we developed a table, which formulate the relationship between verbs and subject types to benefit from sematic relations of them. Languages as in the communication between humans provide the communication between machines and human. Therefore, to be understood of natural languages by machines is important. Processing of natural languages is important for language translation, automated question answering, speech systems, and information retrieval systems. The researches of processing language in the human brain show that brain finds the relationship between heard words. These results give idea for natural language processing studies. Human brains works differently while processing different types of grammars. According to this result, the method processes a language doesn’t work for a different grammar. Constituent analysis is a challenging issue for natural language processing researches. Different methods were used in the recent researches about constituent analysis. Nivre introduces dependency grammar and dependency parser, which plays a fairly marginal role in constituent analysis. Eryiğit, Nivre and Oflazer uses dependency grammar in their study of Dependency Parser of Turkish Cetinoğlu introduces a large-scale grammar implementing Lexical Functional Grammar (LFG) for Turkish. The method Çetinoğlu proposes is based on extracting noun phrases regarding defined rules and analyzing the extracted units in the grammar. Kutlu’s study proposes the first noun phrase chunker for Turkish using dependency parser. Turkish is an agglutinative language from Altaic Language Family. Attaching suffixes to words can generate indefinite number of words. Constituents can change their position to emphasize the meaning of any constituent. The constituent to be emphasized is placed near the verb. Constituents in a sentence can drop. Analysis of sentence is to define the role of the words extracting relationship between words. Different methods can be applied to analyze the sentences. The method proposed and applied in this thesis aims to find noun phrases and their roles. The roles of constituents in the sentence are subject, verb, object and indirect object. To find constituent roles in the sentences, the noun phrases are extracted firstly. To find the noun phrases, the rules of the relation of the words in the sentence are defined. Software is developed to apply these rules and find noun phrases and verb of the sentence. In syntactic analysis, constituent represents a word or a group of words that function as a single unit in the sentence. Well-known constituents for all languages are simply Subject, Predicate and Object. In Turkish sentences, constituents have different roles depending on the meaning they give the sentence. Subject and the predicate are the main constituents of a sentence. Predicate corresponds to the main verb and any verb phrase derived from noun phrases. Subject is the constituent that predicate says something about and mostly nominative noun phrase in the sentence. Subjects may not be stated entirely in Turkish sentences. Object is the constituent that is acted by subject. Objects types are determined regarding forms they have in the sentences such as nominative (Direct Object.Nom), accusative (Direct Object.Acc), dative(Indirect Object.Dat), locative(Indirect Object.Loc), or ablative(Indirect Object.Abl). The other constituents are adverbial clauses and prepositional phrases that prepositional phrases may be considered as an adverbial clause. Adverbial clause is a DC functioning as adverb in the sentence. Prepositional phrases are the units attached to prepositions. In the second step the roles of noun phrases in the sentence is defined. In order to find the roles, rules are defined considering different structure types of Turkish sentences. Turkish has 4 different structure types of sentence that are simple sentence, compound sentence, complex sentence and complex-compound sentence. The rules are different in the different sentence structures. We developed software to find constituent roles executing applied rules. Constituent roles are tagged in the sentence. The rules are not enough for some ambiguous cases. To disambiguation of these cases, different methods are applied. To define the role of candidate constituent’s role in ambiguous cases, some validation methods are implemented. The subject-verb aggregation, semantic relations between verbs and subjects and the constituents that can be used together with the verbs are considered for validation. To validate the constituents, a list includes the semantic relations between subjects and verbs are created. The constituent forms such as dative, locative, ablative and accusative which can be used with the verbs obtained from TDK. The groups of these verbs are used for validation also. The implemented method was tested on different corpuses. According to results of the tests, the success of disambiguation of the words in a sentence and sentence structures is an important factor for the success of the proposed and applied method. Rule based method’s success is higher in simple sentences. The method used was tested with real life data from Turkish newspapers. Test data was separated 3 different groups, which contains sentences regarding their structures. The first group includes only simple sentences and the second group includes with complex, compound and complex-compound sentences. The final group has all sentence types randomly. The results show that the success of the rule-based methodologies is higher in simple sentences. Table-2 shows that precision, recall and f-measure values for subjects in simple sentences. According to the values in table the accuracy 0.84. The test results indicate that, accuracy decreases with the increasing sentence length and sentence structure complexity. The results are promising since they can be improved with new rules working with Turkish linguists.
Açıklama
Tez (Yüksek Lisans) -- İstanbul Teknik Üniversitesi, Fen Bilimleri Enstitüsü, 2013
Thesis (M.Sc.) -- İstanbul Technical University, Institute of Science and Technology, 2013
Thesis (M.Sc.) -- İstanbul Technical University, Institute of Science and Technology, 2013
Anahtar kelimeler
doğal dil işleme,
morfolojik analiz,
belirsizlik giderme,
ad kümeleri,
natural language processing,
morhological analysis,
disambiguation,
nouns phrases