Alt Sekans Profil Haritaları Kullanılarak Protein Katlanması Tanıma
Alt Sekans Profil Haritaları Kullanılarak Protein Katlanması Tanıma
Dosyalar
Tarih
2016-05-11
Yazarlar
Halepmollası, Ruşen
Süreli Yayın başlığı
Süreli Yayın ISSN
Cilt Başlığı
Yayınevi
Fen Bilimleri Enstitüsü
Instıtute of Science and Technology
Instıtute of Science and Technology
Özet
Yaşamın en temel makro molekülleri olan proteinlerin 3b yapısına ait bilgi bioinformatik çalışmalarında kilit bir rol oynar. Hücrenin karmaşık yapısı içersinde nanometre mesafede ve mikro mili saniyede katlanan proteinlerin katlanma örüntülerini tahmin etmek ne yazık ki oldukça zordur. İki protein aynı düzen ve topoloji ile aynı ikincil yapıya sahipse ortak bir katlanma örüntüsüne sahiptir denilebilir. Katlandıktan sonra görevlerini yerine getirmek için hazır hale gelen proteinlerin üç boyutlu yapısı fonksiyonlarına uygun olmalıdır. Belli bir sekans bilgisine ve amino asit özelliklerine dayanarak protein katlanmasını tanıma, proteinlerin 3b yapılarının ve fonksiyonlarının belirlenmesinde önemli bir aşama olarak düşünülebilir. Proteinler arasında yakın evrimsel ilişki olduğunda benzerliği tespit etmek için sekans-sekans eşleştirmesi iyi sonuçlar verir. Ancak iki protein yapısal olarak çok benzer olsa da aralarında sekans benzerliği yoksa bu tür bir eşleştirme etkili değildir. Böyle durumlarda sekaslardan çıkarılan özniteliklere makine öğrenme yöntemleri uygulayarak proteinlerin katlanmasını tahmin etmek daha etkili olur. Bunun için proteinlerin doğada sınırlı sayıda olduğunu varsayıp belli bir sayıda katlanma sınıfı üzerinde çalışılmalıdır. Bu çalışmada, sınırlı sayıda katlanma sınıfı içeren ve literatürde sıkça rastlanan dört adet veri kümesi kullanıldı. Protein katlanmasının tanınmasında amino asitlerin fizyokimyasal özelliklerinden faydalanıldı. Ayrıca ilk kez alt sekans profil haritası (SPMap) kullanılarak makine öğrenme yöntemlerinin uygulanabileceği öznitelikler çıkarıldı. Her katlanma sınıfı için ayrı ayrı elde edilen öznitelikler, iki katmanlı bir yaklaşım ile makine öğrenmesi yöntemlerinden faydalanılarak protein katlanmalarının tahmin edilmesinde kullanıldı. Amino asit özniteliklerine ve alt sekans bilgilerine dayanarak elde edilen öznitelik kümeleri pozitif ve negatif olarak etiketlendi ve ikili sınıflandırma modelleri eğitildi. Oluşturulan modeller ile test veri kümesi sınıflandırıldı. Elde edilen sınıflandırma tahminleri öznitelik vektörleri gibi düşünülerek birleştirildi ve ikinci katmanda kullanılmak üzere yeni bir öznitelik kümesi oluşturuldu. İkinci katmanda çoklu sınıflandırma modeli eğitildi. Bu modeller uygulamanın başında ayrılan test kümesi ile test edildi ve geliştirilen modelin performansı doğruluk oranı, kesinlik, duyarlılık ve F-ölçütü ile değerlendirildi. Önerilen sistem ile performans ölçümünde kullanılan DD veri kümesi üzerinde ortalama %71.7, EDD veri kümesi üzerinde ortalama%75.7, F95 veri kümesi üzerinde ortalama%75.15 doğruluk başarı oranı elde edildi.
Proteins that are very important macromolecules of life are responsible for some of the most essential functions in an organism such as metabolism, transport, immune system, etc. The analysis of proteins tertiary structure is a difficult task because of the complex structure of the cell. Protein fold recognition helps to understand the tertiary structure of the protein. Before the folding, there is a single direction of flow from the DNA linear polymer consists of four different bases (Adenosine, Guanine, Cytosine, Thymine) to a protein consists of different 20 amino acids. This process comprises three stages that are DNA replication, transcription and translation. In the DNA replication stage, two identical replicas are created from original molecule of DNA. In the second stage called transcription, a particular segment of DNA is copied into single stranded RNA (mRNA) by the RNA polymerase enzyme. Next, messenger RNA (mRNA) is translated into a specific amino acid chain in a process called translation. In general, there are three levels of protein structure but in some cases, it can be fourth level. The primary structure is the protein sequence consists of the amino acid chain. The secondary structure is the first stage of protein folding, in which the chain is regulated in regular structures as called "a-helix" and "b-sheet". The tertiary structure formed by the further folding composed of complex and fixed geometric shapes. Three-dimensional proteins folded to form tertiary structure create quaternary structure by coming together. Information belonging to 3D structures of the proteins, which are the most fundamental macromolecules of life, plays a key role in bioinformatics studies. The 3D structure of the proteins which are ready to fulfill their liabilities after the folding have to fit to the functions (the miss folds would cause the Alzheimer, some types of the cancer and Parkinson's). In other words, the information about the structure of the proteins plays an important role in the determining of the different type diseases and improving the effectiveness of the new medicines. Protein fold recognition from amino acid sequences plays a critical role in prediction of protein structures and functions. Therefore information of protein 3-dimensional structures is significant for understanding cellular function and the development of drug design and the biomedicine. Unfortunately, it is very difficult to predict the pattern of the folds for the complex structure of the cells during micro milliseconds and in the nanometric distance. In a case, when two proteins with the same order and topology have the same secondary structure, it can be said that they have the same fold pattern. Protein fold recognition based on the particular sequence information and amino acid properties is the significant step for the determining 3D structure and the functions of the proteins. Applying the sequence-sequence pairing for the determining the similarity of the proteins when the proteins have the close evolutionary relationship gives good results. But when the proteins do not have the sequence similarity even if such type of the pairing will not be effective. In this cases, fold prediction using the machine learning methods on the extracted features of the sequences will be more effective. Assuming that the number of the proteins in the nature is limited will let us to work on the fold classes which have the certain number. The purpose of this work it to extract the specific features from the subsequences and psycho-chemical properties of the amino acids of the proteins, and to predict correctly the fold classes of these proteins using the machine learning methods. Protein fold recognition is very difficult subject theoretically and practically because of the complex structure of the proteins. Crystallization of the proteins using the practical methods and analyzing these fold structure is very hard and expensive process. That is why there is need for the theoretical study using the computational techniques. Once the features are extracted from the protein sequences, any machine learning method can be employed. The recognition process of a query protein sequence in this study can be divided two steps. In the first step, features were extracted from the query sequence. It has benefited from the physicochemical properties of the amino acids for the protein fold recognition. Besides, the attributes on which machine learning methods can be applied are extracted by using subsequence profile map (SPMap) in protein fold recognition for the first time. In the second step, The features exracted from each fold class were used in a two-layer approach to train classifiers to predict correct protein fold belonged to the query sequence. In the first layer, the features, exracted from SPMap and physicochemical properties of amino acids, labeled as positive and negative. Then the feature sets trained binary classifiers and the test set was classified by using these models. The binary classification estimates mind as feature vectors and was combined together. Thus, a new feature set has been created to be used in the second layer and multiple classification models were trained. The developed model was tested with test set separated at the beginning of the application. We used the binary classifier method on R programming in the fist layer, where as used the multiclass classifier methods on Weka and R programming in the second layer. Random Forest was used for the binary classification and Support Vetor Maachine, Random Forest, Multi-class classifiers and ensemble classifier were tried for multiple classification. In this work, we have used four datasets with the limited number of the fold classes. The first dataset, called DD set, has been comprehensively employed in several studies for protein fold recognition. We benefit from it as a benchmark dataset. The dataset which has the most popular 27 fold classes in SCOP database is composed of a training set and testing set. The training set includes 313 protein domain sequences, whereas The testing set includes 385 protein domain sequences. The other three datasets created according to the latest version of SCOP, are called EDD, F95 and F194. EDD - Extended DD set comprises 3397 protein domain sequences with the same 27 fold classes of DD set. F95 and F194 sets created to cover more folds, have less than 40% pairwise sequence identity. F95 - Fold 95 set comprises 6364 protein domain sequences from 95 folds. F194 - Fold 194 set comprises 8026 protein domain sequences from 194 folds. In this study, we used three popular metrics, which are Precision, Recall and F-measure for evaluation of the results.For the evaluate the overall performance of this study, we used the overall accuracy. The classifier performance was evaluated with datasets using our proposed system, and 71.7% for DD dataset, 75.7% for EDD dataset and 75.15% for F95 dataset average accuracy rates were achieved. In our future work, we will benefit from SCOP database hierarchy for the further improvement of prediction accuracy. Besides, we will try feature selection methods after the extract the features.
Proteins that are very important macromolecules of life are responsible for some of the most essential functions in an organism such as metabolism, transport, immune system, etc. The analysis of proteins tertiary structure is a difficult task because of the complex structure of the cell. Protein fold recognition helps to understand the tertiary structure of the protein. Before the folding, there is a single direction of flow from the DNA linear polymer consists of four different bases (Adenosine, Guanine, Cytosine, Thymine) to a protein consists of different 20 amino acids. This process comprises three stages that are DNA replication, transcription and translation. In the DNA replication stage, two identical replicas are created from original molecule of DNA. In the second stage called transcription, a particular segment of DNA is copied into single stranded RNA (mRNA) by the RNA polymerase enzyme. Next, messenger RNA (mRNA) is translated into a specific amino acid chain in a process called translation. In general, there are three levels of protein structure but in some cases, it can be fourth level. The primary structure is the protein sequence consists of the amino acid chain. The secondary structure is the first stage of protein folding, in which the chain is regulated in regular structures as called "a-helix" and "b-sheet". The tertiary structure formed by the further folding composed of complex and fixed geometric shapes. Three-dimensional proteins folded to form tertiary structure create quaternary structure by coming together. Information belonging to 3D structures of the proteins, which are the most fundamental macromolecules of life, plays a key role in bioinformatics studies. The 3D structure of the proteins which are ready to fulfill their liabilities after the folding have to fit to the functions (the miss folds would cause the Alzheimer, some types of the cancer and Parkinson's). In other words, the information about the structure of the proteins plays an important role in the determining of the different type diseases and improving the effectiveness of the new medicines. Protein fold recognition from amino acid sequences plays a critical role in prediction of protein structures and functions. Therefore information of protein 3-dimensional structures is significant for understanding cellular function and the development of drug design and the biomedicine. Unfortunately, it is very difficult to predict the pattern of the folds for the complex structure of the cells during micro milliseconds and in the nanometric distance. In a case, when two proteins with the same order and topology have the same secondary structure, it can be said that they have the same fold pattern. Protein fold recognition based on the particular sequence information and amino acid properties is the significant step for the determining 3D structure and the functions of the proteins. Applying the sequence-sequence pairing for the determining the similarity of the proteins when the proteins have the close evolutionary relationship gives good results. But when the proteins do not have the sequence similarity even if such type of the pairing will not be effective. In this cases, fold prediction using the machine learning methods on the extracted features of the sequences will be more effective. Assuming that the number of the proteins in the nature is limited will let us to work on the fold classes which have the certain number. The purpose of this work it to extract the specific features from the subsequences and psycho-chemical properties of the amino acids of the proteins, and to predict correctly the fold classes of these proteins using the machine learning methods. Protein fold recognition is very difficult subject theoretically and practically because of the complex structure of the proteins. Crystallization of the proteins using the practical methods and analyzing these fold structure is very hard and expensive process. That is why there is need for the theoretical study using the computational techniques. Once the features are extracted from the protein sequences, any machine learning method can be employed. The recognition process of a query protein sequence in this study can be divided two steps. In the first step, features were extracted from the query sequence. It has benefited from the physicochemical properties of the amino acids for the protein fold recognition. Besides, the attributes on which machine learning methods can be applied are extracted by using subsequence profile map (SPMap) in protein fold recognition for the first time. In the second step, The features exracted from each fold class were used in a two-layer approach to train classifiers to predict correct protein fold belonged to the query sequence. In the first layer, the features, exracted from SPMap and physicochemical properties of amino acids, labeled as positive and negative. Then the feature sets trained binary classifiers and the test set was classified by using these models. The binary classification estimates mind as feature vectors and was combined together. Thus, a new feature set has been created to be used in the second layer and multiple classification models were trained. The developed model was tested with test set separated at the beginning of the application. We used the binary classifier method on R programming in the fist layer, where as used the multiclass classifier methods on Weka and R programming in the second layer. Random Forest was used for the binary classification and Support Vetor Maachine, Random Forest, Multi-class classifiers and ensemble classifier were tried for multiple classification. In this work, we have used four datasets with the limited number of the fold classes. The first dataset, called DD set, has been comprehensively employed in several studies for protein fold recognition. We benefit from it as a benchmark dataset. The dataset which has the most popular 27 fold classes in SCOP database is composed of a training set and testing set. The training set includes 313 protein domain sequences, whereas The testing set includes 385 protein domain sequences. The other three datasets created according to the latest version of SCOP, are called EDD, F95 and F194. EDD - Extended DD set comprises 3397 protein domain sequences with the same 27 fold classes of DD set. F95 and F194 sets created to cover more folds, have less than 40% pairwise sequence identity. F95 - Fold 95 set comprises 6364 protein domain sequences from 95 folds. F194 - Fold 194 set comprises 8026 protein domain sequences from 194 folds. In this study, we used three popular metrics, which are Precision, Recall and F-measure for evaluation of the results.For the evaluate the overall performance of this study, we used the overall accuracy. The classifier performance was evaluated with datasets using our proposed system, and 71.7% for DD dataset, 75.7% for EDD dataset and 75.15% for F95 dataset average accuracy rates were achieved. In our future work, we will benefit from SCOP database hierarchy for the further improvement of prediction accuracy. Besides, we will try feature selection methods after the extract the features.
Açıklama
Tez (Yüksek Lisans) -- İstanbul Teknik Üniversitesi, Fen Bilimleri Enstitüsü, 2016
Thesis (M.Sc.) -- İstanbul Technical University, Instıtute of Science and Technology, 2016
Thesis (M.Sc.) -- İstanbul Technical University, Instıtute of Science and Technology, 2016
Anahtar kelimeler
Protein,
Katlanma,
Protein Katlanması,
Protein Katlanması Tanıma,
Alt Sekans,
Spmap,
Bioinformatik,
Protein,
Protein Folding,
Protein Fold Recognition,
Subsequence,
Spmap,
Bioinformatics