Türkçe Metinlerin Etiketlenmesi
Yükleniyor...
Dosyalar
Tarih
item.page.authors
Süreli Yayın başlığı
Süreli Yayın ISSN
Cilt Başlığı
Yayınevi
Fen Bilimleri Enstitüsü
Institute of Science and Technology
Institute of Science and Technology
Özet
Her geçen gün belge sayısı artan Web in tam potansiyeliyle kullanılması için anlamsal ağ alanındaki çalışmaların Web in geleceğini oluşturacağı düşünülmektedir. Belge sayısındaki bu artışa bağlı olarak istenilen metne erişebilmek için bu metni en iyi temsil eden söz öbeklerinin bulunması doğru bir yaklaşım olacaktır. Tüm metni okumadan o metni en iyi ifade edecek söz öbeklerine erişmek hem kullanıcı açısından hem de tarayıcı açısından büyük önem taşımaktadır. Bu çalışmanın amacı haber metinlerinde, haber metninin öznesi, yüklemi, yer ve zamanını belirtecek söz öbeklerinin metinde bulunup, metnin etiketlenmesidir. Bu amaçla, metinde geçen cümleler içerisinden seçilen en baskın özne, yüklem, yer ve zaman bilgilerinin çıkarılması hedeflenmektedir. Elde edilen bu etiket bilgileri sayesinde metnin konusu temsil edilmektedir. Bu sayede anlamsal ağda etiket olarak kullanılabilir ve arama motorlarında istenilen veriye ulaşabilmek için kullanılabilir. Hedefimiz doğrultusunda ilk olarak, metindeki cümleler biçimbilimsel çözümleyicide analiz edilmiştir. Bunun nedeni eklemeli bir dil olan Türkçe de sözcüklerin gövdelerine erişmektir. Biçimbilimsel çözümleyicinin sonucunda, her sözcük için birden fazla çözüm üretilmektedir.. Bu nedenle bulunan çözümlerden en yüksek olasılıklı olanı bulmak için belirsizlik gidericiye ihtiyaç vardır. Sözdizimsel çözümelere erişmek için de sözdizimsel çözümleme işlemi yapmak gerekmektedir. Çalışmamızda bir metin ilk olarak yukarıda sıralanan üç aşamalı çözümleme işleminden geçirilmiştir. Tez çalışmasının ilk kısmında biçimbilimsel ve sözdizimsel çözümü çıkarılmış olan metinlerden kurallar çıkarılarak etiketleme işlemi yapılmaya çalışılmışsa da yeterli başarımı elde edilememiştir. Bu nedenle, çıkaramadığımız bazı kuralları çıkarabileceğini düşünerek makine öğrenmesi yöntemleri üzerinde çalışılmıştır. Makine öğrenmesi yöntemi olarak bir dizilim sınıflandırıcısı olan Koşullu Rastgele Alanlar (CRF) üzerinde çalışılmıştır. Kural tabanlı yaklaşımda elde ettiğimiz bazı kuralları kullanarak ve çözümleyici çıktılarını kullanarak metindeki her bir sözcüğe ait nitelikler belirlenmiştir. Önceden elle işaretlediğimiz metinleri ve belirlenen nitelikleri kullanarak, CRF modelimizi oluşturulmuştur. Daha sonra önceden etiketlenmemiş metinleri, bu model sayesinde etiketleme işlemini geliştirilmiştir. Bu çalışmanın bilimsel ve teknik katkısını ortaya çıkarabilmek için, sınama kümesindeki elle etiketlediğimiz metinlerin etiketlerini CRF in ürettiği etiketler ile karşılaştırıp başarımımızı tutturma ve bulma olasılıkları ve bunlardan türeyen F-ölçüm oranı cinsinden ölçülmüştür.
Most current significant word extraction from a document uses keyphrase extraction features. In this thesis, a new approach that is labeling the main subject, main predicate, main location and main date of a electronic document is introduced. The main subject label tells whom or what the document about. The main predicate label tells what the subject is or does. The main location label tells where the document passed and the main date label tells when the document passed. With the help of this new methodology, extraction of not only high level description of the content, but also the attribute of a phrase in a document are provided. As an experiment set Turkish news are selected. To use as a training and test set, manual labeling is made by human annotators. Then, different models for each label are implemented to extract the labels automatically and they are compared to manually labeled results. As internet grows dramatically, the number of electronic text documents increases considerably. By means of increasing number of documents, the information extraction grows in importance. On this account, there are several researches to reach the information needed. This thesis introduces a new approach to information extraction, which provides extraction of the main subject, main predicate, main location and main date of a text document and label it to use for semantic web applications. This approach is a new field of study, which aims to short summary of a text with the help of labeled entities. The most pronounced difference between keyphrase extraction studies and labeling study presented in this thesis is that this study extract the most significant phrases with their functions in the document. The news in Turkish language is selected as an experiment set in this study. Labeling the main subject, main predicate,main location and main date of news which are gathered from web, is totally new field of study which is introduced in this thesis. As a literature survey it is gathered that the best similar studies are focused on the keyphrase extraction. To use as a training and test set, 200 raw news are gathered from RSS feeds of Turkish news distributors from internet. All these news are converted to the XML file.Then all labels are manually annotated. If they cannot find the label in the document,they should enter the label tag with dash punctuation. This means there is not any proper label in the document. 150 news are used as a training set to obtain best model of extraction each label. 50 news are used as test set to compare manually annotated results with automatically extracted results. In order to decide whether to label a phrases as a keyphrase, the words in the document must be distinguished by using specified features and also the properties of keyphrases have to be identified. The first possible feature that comes into mind is the frequency, which is the number of times a keyphrase appears in the text. It is obvious that the more important phrases will be more used in a text. Second one is the first place the phrase occurs in a document has more priority for labeling. To extract the phrases in a document, several models can be used as named entity recognition or collocation extraction models. Subject label indicates what or whom about the document. Due to the experiment set of this study is news, main subject and main location of the text should be proper noun phrases. This assumption is obtained after inspected all manually annotated subject labels. In order to obtain proper name phrases in Turkish language, firstly all words start with capital letter are gathered. However, this assumption is not correct at all because some other words may start with capital letter, such as first word of sentence, titles, month or day names in dates etc. First of all, all words starts with capital letter and conjunctions between them are gathered. Because of some of proper name phareses sequentially present at the document. Some of Turkish language rules defined. For example, If the word is first word in a sentence and it is a proper name, it is a possible candidate of proper noun phrase. If a word starts with capital letter and not the first word of sentence, select it as a possible candidate of proper name phrase. If a conjunction is between two possible candidates of proper name phrases, select this word. But all these rules are not enough to divide all these words into proper noun phrases. For instance, “Mustafa Kemal Atatürk Ankara’ya gitti.”is a sample Turkish sentence. In this sentence Mustafa Kemal Atatürk ”and “Ankara”are two different proper noun phrases. However, the rules explained above selects the proper name phrase as “Mustafa Kemal Atatürk Ankara’ya”. So new boundary rules are defined.For instance, if a possible candidate of proper noun etc, this word is the last name of proper noun phrase. If a possible candidate has the suffix as ”P3sg”, this word is a last word of proper noun phrases. However all these rules are not adequate to select main subject of text. So, Conditional Random Fields are used as a machine learning classifier. Due to the Turkish is an agglutinative language, input file is converted to the file includes the information of stems, inflectional suffixes and parser results of the raw new. Input file is converted to the file includes the information of stems, inflectional suffixes and parser results of the raw new. The reason why we need stems and inflectional suffixes is Turkish is an agglutinative language. Turkish language has few prefixes and many suffixes. Labelling of a Turkish text is not so easy as an English text. The processing steps of our algorithms are given as follows: \begin{itemize} \item Morphogical analysis \item Morphogical disambugation \item Dependancy parser \end{itemize} After preprocessing document, to devolop CRF system, features are selected as following categories: \begin{itemize} \item Rule based features \item Morphological features \item Syntactic Features \item Structural Features \end{itemize} When the feature selection is completed, realted fatures are assigned to the training set. As using feature assigned training set, CRF model is trained. Then, this model is ready to label any unlabeled news. During evaluation, test set is used to compare the annotator’s tags with CRF tags. For each label, this comparison is made. If a phrase in annotator’s tag is exactly the same as the phrase in program tag or one tag contains the other tag , it is assumed as labeling is correct for this phrase. In this study, the main concern is the precision and the recall that is how many of the suggested keywords are correct (precision), and how many of the manually assigned labels that are found (recall). We measure the performance of the algorithm in relation to the labels assigned by the annotators. The problem caused by errors in automatic morphological analysis, disambiguation and dependency parser should be taking into account during evaluation of the results. Another important effect is that the use of spell checker can increase parsing accuracy substantially. By combining the linguistic rules approach with statistical approaches,we have been able to achieve the highest accuracy of labeling documents.
Most current significant word extraction from a document uses keyphrase extraction features. In this thesis, a new approach that is labeling the main subject, main predicate, main location and main date of a electronic document is introduced. The main subject label tells whom or what the document about. The main predicate label tells what the subject is or does. The main location label tells where the document passed and the main date label tells when the document passed. With the help of this new methodology, extraction of not only high level description of the content, but also the attribute of a phrase in a document are provided. As an experiment set Turkish news are selected. To use as a training and test set, manual labeling is made by human annotators. Then, different models for each label are implemented to extract the labels automatically and they are compared to manually labeled results. As internet grows dramatically, the number of electronic text documents increases considerably. By means of increasing number of documents, the information extraction grows in importance. On this account, there are several researches to reach the information needed. This thesis introduces a new approach to information extraction, which provides extraction of the main subject, main predicate, main location and main date of a text document and label it to use for semantic web applications. This approach is a new field of study, which aims to short summary of a text with the help of labeled entities. The most pronounced difference between keyphrase extraction studies and labeling study presented in this thesis is that this study extract the most significant phrases with their functions in the document. The news in Turkish language is selected as an experiment set in this study. Labeling the main subject, main predicate,main location and main date of news which are gathered from web, is totally new field of study which is introduced in this thesis. As a literature survey it is gathered that the best similar studies are focused on the keyphrase extraction. To use as a training and test set, 200 raw news are gathered from RSS feeds of Turkish news distributors from internet. All these news are converted to the XML file.Then all labels are manually annotated. If they cannot find the label in the document,they should enter the label tag with dash punctuation. This means there is not any proper label in the document. 150 news are used as a training set to obtain best model of extraction each label. 50 news are used as test set to compare manually annotated results with automatically extracted results. In order to decide whether to label a phrases as a keyphrase, the words in the document must be distinguished by using specified features and also the properties of keyphrases have to be identified. The first possible feature that comes into mind is the frequency, which is the number of times a keyphrase appears in the text. It is obvious that the more important phrases will be more used in a text. Second one is the first place the phrase occurs in a document has more priority for labeling. To extract the phrases in a document, several models can be used as named entity recognition or collocation extraction models. Subject label indicates what or whom about the document. Due to the experiment set of this study is news, main subject and main location of the text should be proper noun phrases. This assumption is obtained after inspected all manually annotated subject labels. In order to obtain proper name phrases in Turkish language, firstly all words start with capital letter are gathered. However, this assumption is not correct at all because some other words may start with capital letter, such as first word of sentence, titles, month or day names in dates etc. First of all, all words starts with capital letter and conjunctions between them are gathered. Because of some of proper name phareses sequentially present at the document. Some of Turkish language rules defined. For example, If the word is first word in a sentence and it is a proper name, it is a possible candidate of proper noun phrase. If a word starts with capital letter and not the first word of sentence, select it as a possible candidate of proper name phrase. If a conjunction is between two possible candidates of proper name phrases, select this word. But all these rules are not enough to divide all these words into proper noun phrases. For instance, “Mustafa Kemal Atatürk Ankara’ya gitti.”is a sample Turkish sentence. In this sentence Mustafa Kemal Atatürk ”and “Ankara”are two different proper noun phrases. However, the rules explained above selects the proper name phrase as “Mustafa Kemal Atatürk Ankara’ya”. So new boundary rules are defined.For instance, if a possible candidate of proper noun etc, this word is the last name of proper noun phrase. If a possible candidate has the suffix as ”P3sg”, this word is a last word of proper noun phrases. However all these rules are not adequate to select main subject of text. So, Conditional Random Fields are used as a machine learning classifier. Due to the Turkish is an agglutinative language, input file is converted to the file includes the information of stems, inflectional suffixes and parser results of the raw new. Input file is converted to the file includes the information of stems, inflectional suffixes and parser results of the raw new. The reason why we need stems and inflectional suffixes is Turkish is an agglutinative language. Turkish language has few prefixes and many suffixes. Labelling of a Turkish text is not so easy as an English text. The processing steps of our algorithms are given as follows: \begin{itemize} \item Morphogical analysis \item Morphogical disambugation \item Dependancy parser \end{itemize} After preprocessing document, to devolop CRF system, features are selected as following categories: \begin{itemize} \item Rule based features \item Morphological features \item Syntactic Features \item Structural Features \end{itemize} When the feature selection is completed, realted fatures are assigned to the training set. As using feature assigned training set, CRF model is trained. Then, this model is ready to label any unlabeled news. During evaluation, test set is used to compare the annotator’s tags with CRF tags. For each label, this comparison is made. If a phrase in annotator’s tag is exactly the same as the phrase in program tag or one tag contains the other tag , it is assumed as labeling is correct for this phrase. In this study, the main concern is the precision and the recall that is how many of the suggested keywords are correct (precision), and how many of the manually assigned labels that are found (recall). We measure the performance of the algorithm in relation to the labels assigned by the annotators. The problem caused by errors in automatic morphological analysis, disambiguation and dependency parser should be taking into account during evaluation of the results. Another important effect is that the use of spell checker can increase parsing accuracy substantially. By combining the linguistic rules approach with statistical approaches,we have been able to achieve the highest accuracy of labeling documents.
Açıklama
Tez (Yüksek Lisans) -- İstanbul Teknik Üniversitesi, Fen Bilimleri Enstitüsü, 2012
Thesis (M.Sc.) -- İstanbul Technical University, Institute of Science and Technology, 2012
Thesis (M.Sc.) -- İstanbul Technical University, Institute of Science and Technology, 2012
Konusu
doğal dil işleme, bilgi çıkarımı, natural language processing, information extraction
