ISTANBUL TECHNICAL UNIVERSITY ⋆ GRADUATE SCHOOL VISUALIZATION BASED ANALYSIS OF GENE NETWORKS USING HIGH DIMENSIONAL MODEL REPRESENTATION M.Sc. THESIS Pınar GÜLER Department of Computational Science and Engineering Computational Science and Engineering Programme JULY 2024 ISTANBUL TECHNICAL UNIVERSITY ⋆ GRADUATE SCHOOL VISUALIZATION BASED ANALYSIS OF GENE NETWORKS USING HIGH DIMENSIONAL MODEL REPRESENTATION M.Sc. THESIS Pınar GÜLER (702211009) Department of Computational Science and Engineering Computational Science and Engineering Programme Thesis Advisor: Asst. Prof. Dr. Süha TUNA JULY 2024 İSTANBUL TEKNİK ÜNİVERSİTESİ ⋆ LİSANSÜSTÜ EĞİTİM ENSTİTÜSÜ YÜKSEK BOYUTLU MODEL GÖSTERİLİM KULLANILARAK GEN AĞLARININ GÖRSELLEŞTİRME TABANLI ANALİZİ YÜKSEK LİSANS TEZİ Pınar GÜLER (702211009) Hesaplamalı Bilim ve Mühendislik Anabilim Dalı Hesaplamalı Bilim ve Mühendislik Programı Tez Danışmanı: Dr. Öğr. Üyesi Süha TUNA TEMMUZ 2024 Pınar GÜLER, a M.Sc. student of ITU Graduate School student ID 702211009 suc- cessfully defended the thesis entitled “VISUALIZATION BASED ANALYSIS OF GENE NETWORKS USING HIGH DIMENSIONAL MODEL REPRESENTATION”, which she prepared after fulfilling the requirements specified in the associated legisla- tions, before the jury whose signatures are below. Thesis Advisor : Asst. Prof. Dr. Süha TUNA .............................. Istanbul Technical University Jury Members : Assoc. Prof. Dr. Burcu TUNGA .............................. Istanbul Technical University Assoc. Prof. Dr. Yelda TARKAN ARGÜDEN .............................. Istanbul University - Cerrahpaşa Date of Submission : 24 May 2024 Date of Defense : 1 July 2024 v vi To my spouse, vii viii FOREWORD Completing this master’s thesis has been a rewarding journey for me, fostering both personal and professional growth. I extend my sincere gratitude to my advisor, Dr. Süha TUNA, for his unwavering support, guidance, and motivating encouragement throughout this academic endeavour. Additionally, I express deep gratitude to my spouse, my family and friends, whose presence and encouragement have been a constant source of strength and inspiration, sustaining me through every phase of this process. July 2024 Pınar GÜLER ix x TABLE OF CONTENTS Page FOREWORD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix TABLE OF CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi ABBREVIATIONS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii SYMBOLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix SUMMARY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxi ÖZET . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xxiii 1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Purpose of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2. METHODS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1 Chaos Game Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 VARCH. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3 High Dimensional Model Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.4 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3. IMPLEMENTATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.1 mTOR Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2 Numerical Experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.2.1 Balanced dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.2.2 Imbalanced dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4. RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5. CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 CURRICULUM VITAE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 xi xii ABBREVIATIONS DNA : Deoxyribonucleic Acid RNA : Ribonucleic Acid SNP : Single Nucleotide Polymorphism GWAS : Genome-Wide Association Studies CGR : Chaos Game Representation HDMR : High-Dimensional Model Representation SVM : Support Vector Machines RBF : Radial Basis Function BCE : Binary Cross-Entropy PPI : Protein-Protein Interaction ChIP : Chromatin Immunoprecipitation PRS : Polygenic Risk Score LD : Linkage Disequilibrium 1D-CNN : 1D Convolutional Neural Network 2D-CNN : 2D Convolutional Neural Network VLC : Variant Logic Constructor BioGRID : Biological General Repository for Interaction Datasets ChIP : Chromatin ImmunoPrecipitation RICTOR : Rapamycin-Insensitive Companion of mTOR mTOR : Mammalian Target of Rapamycin KEGG : Kyoto Encyclopedia of Genes and Genomes NCBI : National Center for Biotechnology Information mTORC1 : Mammalian Target of Rapamycin Complex 1 mTORC2 : Mammalian Target of Rapamycin Complex 2 STRING : Search Tool for Retrieval of Interacting Genes xiii xiv SYMBOLS A : Adenine T : Thymine C : Cytosine G : Guanine T i α : The x-axis element in frequency matrix of VARCH T h β : The y-axis element in frequency matrix of VARCH ⊥ : Variant Logic Operator + : Variant Logic Operator − : Variant Logic Operator ⊤ : Variant Logic Operator k−mer : Substrings of length k in a sequence N : The length of DNA Sequence in VARCH Method Len : The length of a subsequence in VARCH Method m : The number of subsequences in VARCH Method H : Principal HDMR Tensor K : Linear Kernel of SVM si : HDMR support vectors h0 : 0-dimensional (scalar) HDMR component h1,h2,h3 : 1-dimensional (vector) HDMR components xv xvi LIST OF TABLES Page Table 2.1 : The relationships in variant logic construction. . . . . . . . . . . . . . . . . . . . . . . . . 14 Table 2.2 : The relationship between subsets and values. . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Table 4.1 : VARCH results for varying Len and m parameters (Tα = 9, Tβ = 6). . 37 Table 4.2 : Classification accuracy performances of Tα and Tβ pairs (Len = 20,m = 20). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Table 4.3 : VARCH (Len = 10, m = 160) and CGR (n = 5) results of 400 patient/control data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Table 4.4 : VARCH (Len = 10,m = 160) and CGR (n = 5) results of 100 patient/400 control data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 xvii xviii LIST OF FIGURES Page Figure 2.1 : 32×32 CGR images of mTOR gene: GSK3B (top-left), RICTOR (top- right), SLC38A9 (bottom-left), MTOR (bottom-right). . . . . . . . . 12 Figure 2.2 : CGR process of genomic sequence data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Figure 2.3 : Subsegmentation of the genomic sequence data. . . . . . . . . . . . . . . . . . . . . . 16 Figure 2.4 : VARCH images of varying Tα and Tβ values (Len = 10, m = 200). . 18 Figure 2.5 : Binary classification using SVM linear kernel. . . . . . . . . . . . . . . . . . . . . . . . 27 Figure 4.1 : Flowchart of the proposed Gene Network Analysis Process. . . . . . . . . 34 Figure 4.2 : Classification accuracies depending on k−mer parameter for CGR and HDMR (400 patient/control data). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Figure 4.3 : Varying effects of VARCH’s sequence length on accuracy, F1 score, recall and BCE loss. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 xix xx VISUALIZATION BASED ANALYSIS OF GENE NETWORKS USING HIGH DIMENSIONAL MODEL REPRESENTATION SUMMARY Genetic studies have revolutionized our understanding of the biological mechanisms underlying health and disease. By exploring the intricate details of the human genome, researchers can identify genetic variations that contribute to various phenotypic outcomes. One of the key advancements in this field is gene network analysis, which examines the complex interactions between genes and how they regulate cellular processes. This approach provides a comprehensive view of the biological systems and uncovers the pathways involved in disease mechanisms. Genome-Wide Association Studies (GWAS) play a pivotal role among the methodologies utilized in gene network analysis. GWAS involves scanning the genome for slight variations, known as single nucleotide polymorphisms (SNPs), that occur more frequently in individuals with a particular disease or trait than in those without. By identifying these associations, GWAS helps pinpoint genetic factors contributing to disease susceptibility and progression, paving the way for personalized medicine and targeted therapeutic strategies. By integrating various variant analysis techniques, researchers can develop a deeper understanding of the genetic architecture of diseases, leading to significant advancements in diagnostics, treatment, and prevention. Gene network and pathway analyses are essential components of genetic studies, offering insights into genes’ complex interactions and functions within a biological systems. However, both face significant computational challenges, mainly when dealing with high-dimensional genomic data. Analyzing vast datasets containing gene expression profiles and genetic variations demands sophisticated computational methods capable of handling their scale and complexity. Conventional statistical methods frequently require assistance to become effective, demanding complex computational approaches like data visualization, network modeling, and machine learning algorithms. In addition, the complexity of biological networks and pathways makes analysis even more complicated, necessitating the use of powerful computational tools to interpret regulatory mechanisms and simulate complex biological processes correctly. Overcoming these challenges is crucial for gaining deeper insights into gene networks and pathways, thereby advancing our understanding of their roles in health and disease. In pathway analysis, scientists employ data collected from many sources, such as Genome-Wide Association Studies (GWAS), to identify target genes and connect them to known pathways using Kyoto Encyclopedia of Genes and Genomes (KEGG) databases. However, pathway analysis presents major computing challenges, especially when large, high-dimensional genomic datasets are involved. Researchers have developed innovative methods such as High Dimensional Model Representation (HDMR), Chaos Game Representation (CGR), and visual xxi analysis of DNA sequences based on a variant logic construction method called VARCH to overcome these challenges. By mapping genetic sequences into visual representations, these innovative approaches can help identify potential genetic markers and better understand biological processes. These computational methods must be included in gene network and pathway investigations to fully understand the complex architecture of genetic interactions and how they affect health and diseases. In this thesis, we harnessed three sophisticated computational methodologies: Chaos Game Representation, visual analysis of DNA sequences based on variant logic construction called VARCH, and High Dimensional Model Representation, each offering unique contributions to the variant analysis, respectively CGR, a prevalent technique in bioinformatics, translates genetic sequences into visually interpretable diagrams, clarifying complex structures and patterns in the sequences. On the other hand, VARCH converts sequences into a feature space, successfully capturing each aspect of their complexity and uncertainty. These techniques are effective instruments in our search for potential genetic markers that might help us distinguish between the patient and control groups in our investigation. Furthermore, we utilized HDMR for dimension reduction, an essential technique for simplifying the complex structure in high-dimensional genomic data. By condensing data dimensions, HDMR facilitated more efficient and accurate classification, enabling us to uncover sensitive genetic relationships and patterns that might have remained hidden otherwise. Integrating these computational techniques provided robust solutions for analyzing genetic data from the mTOR pathway, enriching our comprehension of the genetic mechanisms supporting various phenotypic outcomes. In our study, we begin on a mission to deepen our comprehension of the intricate genetic patterns intertwined with diverse phenotypic outcomes. Focusing on genetic data sourced from the mTOR pathway, we leveraged state-of-the-art computational methodologies to unravel hidden insights. Our primary objective was to assess the efficacy of CGR, VARCH, and HDMR in gene network analyses. As we analyzed the data, the results were quite compelling. Both CGR and VARCH methods demonstrated notable accuracy in genetic classification, with VARCH exhibiting a significant edge over CGR in terms of accuracy and sensitivity metrics. This superiority was underscored by VARCH’s ability to considerably minimize binary cross-entropy (BCE) loss values, demonstrating the ability to reduce errors in predictions. However, we examined the computing overheads associated with each methodology in detail, providing insight into the challenging trade-off between computational complexity and accuracy. Despite the more significant parameters, VARCH’s computational requirements were apparent, although its performance was better than CGR’s. Our study demonstrates the potential of computational tools for unraveling gene complexities while also acting as an essential reminder of how crucial it is to overcome the complex environment of computational constraints carefully, helping researchers search for the best possible method selection and optimization. xxii YÜKSEK BOYUTLU MODEL GÖSTERİLİM KULLANILARAK GEN AĞLARININ GÖRSELLEŞTİRME TABANLI ANALİZİ ÖZET Genetik çalışmalar, hastalıklarda altta yatan biyolojik mekanizmaların anlaşılmasında devrim niteliğinde ilerlemeler sağlamıştır. Bu ilerlemeler, genetik bilgilere dayalı sağlık hizmetlerinin kişiselleştirilmesinde önemli bir rol oynamaktadır. Araştırmacılar, insan genomunun karmaşık mekanizmalarını keşfederek, çeşitli fenotipik sonuçlara katkıda bulunan genetik varyasyonları tespit edebilirler. Bu keşifler, hastalıkların genetik temellerini anlamada ve bu sayede risk grubunda yer alan bireylerin erken tespitinde ve aynı zamanda yeni tedavi stratejileri geliştirmede kritik öneme sahiptir. Bu alandaki temel ilerlemelerden biri, gen ağ analizidir. Bu yaklaşım genler arasındaki karmaşık etkileşimleri ve hücresel süreçleri nasıl düzenlediklerini inceler. Gen ağ analizleri, genlerin birbirleriyle olan ilişkilerini ve bu ilişkilerin biyolojik fonksiyonlar üzerindeki etkilerini ortaya çıkartarak daha kapsamlı bir biyolojik anlayış sağlar ve bu sayede hastalık mekanizmalarında yer alan yolları açığa çıkarır. Genom çapında birleşim çalışmaları (GWAS), gen ağ analizinde kullanılan metodlar arasında kilit bir rol oynamaktadır. GWAS, büyük ölçekli veri setlerinden genetik ilişkileri belirlemek için kullanılan güçlü bir araçtır. GWAS, belirli bir hastalığı veya hastalığa ait genetik özelliğine sahip olan bireylerde daha sık görülen küçük varyasyonları, tek nükleotid polimorfizmleri (SNP’ler), genom çapında taramayı içermektedir. Bu ilişkileri ayırt ederek, GWAS hastalık duyarlılığını ve ilerlemesine katkıda bulunan genetik faktörleri belirler ve kişiselleştirilmiş tıp hizmetlerine ve hedefe yönelik tedaviler için temel oluşturur. Kişiselleştirilmiş tıp, bireylerin genetik profillerine dayalı olarak daha etkili tedavi yaklaşımları geliştirmeyi mümkün kılar. Farklı genetik analiz tekniklerinin bir araya getirilmesi, araştırmacıların hastalıkların genetik yapılarını daha iyi anlamalarını sağlar ve bu da teşhis, tedavi ve önleme alanlarında önemli ilerlemelere yol açar. Bu entegrasyon, genetik bilgilerin klinik uygulamalarda nasıl kullanılabileceğini belirler ve sağlık alanındaki yenilikçi çözümler için temel oluşturur. Gen ağları analizi ve yol analizi, genetik çalışmaların vazgeçilmez bileşenleridir ve biyolojik sistemler içindeki genlerin karmaşık etkileşimlerini ve fonksiyonlarını açıklar. Bu yöntemler, genetik araştırmalarda derinlemesine bilgi sağlar ve hastalıkların moleküler temelini aydınlatmada büyük bir rol oynar. Ancak, her iki yöntemde de özellikle yüksek boyutlu genom verileriyle uğraşırken önemli hesaplama zorluklarıyla karşılaşılır. Bu zorluklar, verilerin karmaşıklığı ve veri setlerinin büyüklüğü nedeniyle ortaya çıkar. Bu sebepler nedeniyle veri analizi süreci oldukça karmaşık hale gelmektedir. Gen profilleri ve genetik varyasyonlar içeren geniş veri kümelerinin analizi, ölçeğini ve karmaşıklığını yönetebilen gelişmiş hesaplama yöntemlerini gerektirir. Geleneksel istatistiksel yöntemler yerine veri görselleştirme, ağ modelleme ve makine öğrenimi algoritmaları gibi karmaşık xxiii hesaplama yaklaşımlarını gerektirir. Bu yöntemler, verilerin daha etkin bir şekilde analiz edilmesini ve yorumlanmasını sağlar. Gen ağlarının karmaşıklığı ve gen regülasyon mekanizmalarını anlamak, karmaşık biyolojik süreçleri doğru bir şekilde simüle etmek için güçlü hesaplama araçları gerektirir; bu zorlukları aşmak, gen ağlarının ve yollarının sağlık ve hastalık üzerindeki rollerini daha derinlemesine anlamamıza olanak tanır. Özellikle kompleks hastalıkların incelenmesinde, gen ağlarının doğru bir şekilde modellenmesi ve analiz edilmesi kritik bir öneme sahiptir. Yol analizinde, araştırmacılar, GWAS gibi çeşitli kaynaklardan elde edilen verileri kullanarak hedef genleri belirler ve Kyoto Genlerin ve Genomların Ansiklopedisi (KEGG) veritabanlarını kullanarak bu genleri bilinen yollarla ilişkilendirirler. KEGG veritabanı, biyolojik yolları ve bunların genlerle olan ilişkilerini anlamada önemli bir kaynaktır. Ancak, özellikle yüksek boyutlu ve büyük genom veri kümeleriyle çalışıldığında, yol analizi çalışmaları önemli hesaplama zorluklarıyla karşılaşır. Bu zorlukların üstesinden gelmek için DNA dizilerinin görsel analizini içeren yenilikçi yöntemler geliştirilmiştir; bunlar arasında Kaos Oyunu Temsili (CGR) ve varyant mantık yapısına dayalı DNA dizilerinin görsel analiz yöntemi olan VARCH bulunmaktadır. Genetik dizilerin görsel temsili, potansiyel genetik sinyal yolları belirlemeye ve biyolojik süreçleri daha iyi anlamaya yardımcı olur. Görsel analiz yöntemleri, karmaşık genetik verilerin anlaşılmasını kolaylaştırır ve genetik araştırmalarda yeni perspektifler sunar. Bu hesaplama yöntemlerinin gen ağları ve yol analizlerine entegre edilmesi, genetik etkileşimlerin karmaşık yapısını tam olarak anlamak ve bunların hastalıklar üzerindeki etkilerini anlamak için hayati öneme sahiptir. Böylece, genetik araştırmalarda daha hassas ve doğru sonuçlar elde edilir. Tez kapsamında, gen ağları analizimize katkılar sunan iki adet görselleştirme ve bir data ayrıştırma yöntemi kullanıldı. Bu yöntemler, genetik verilerin daha iyi anlaşılmasını ve analiz edilmesini sağlayan güçlü araçlardır. Kaos Oyunu Temsili, varyant mantık yapısına dayalı DNA dizilerinin görsel analizi olan VARCH ve Yüksek Boyutlu Model Gösterilim (HDMR) yöntemlerini kullandık. CGR, biyoinformatikte yaygın kullanılan bir yöntem olup, genetik dizileri görsel olarak yorumlanabilir diyagramlara çevirerek dizilerdeki karmaşık yapıları ve desenleri açıklar. CGR yöntemi, genetik dizilerin görselleştirilmesi ile büyük veri setlerindeki yapısal özellikleri ortaya çıkarır. Öte yandan, VARCH, gen sekanslarını bir özellik vektörüne dönüştürerek, bu sekansların karmaşık yapılarını başarıyla yakalamaktadır. VARCH, genetik verilerin detaylı analizini mümkün kılarak, araştırmacıların gen sekanslarındaki önemli paternleri tespit etmelerini sağlar. Her iki yöntem de, araştırmamızda hasta ve kontrol gruplarını ayırt edebilecek potansiyel genetik nitelikleri aramamızda etkili olmuştur. Bu sayede, belirli hastalık fenotiplerine özgü genetik işaretçileri daha kolay belirleyebildik. Bunun ile birlikte, genlerin karmaşık yapılarının basitleştirilmesi için kullanılabilecek önemli bir teknik olan boyut indirgeme için HDMR methodunu kullandık. HDMR, yüksek boyutlu veri setlerinde önemli bilgileri koruyarak veri boyutunu azaltmak için etkili bir yöntemdir. Kullanılan mTOR veri setindeki verilerin görselleştirme yöntemlerinden elde edilen N boyutlu tensörlere uygulanan HDMR, hesaplama açısından belirgin bir verimlilik sağlamış ve aynı zamanda yüksek sınıflandırma xxiv performansı göstermiştir. Bu teknik, büyük veri setlerinde hızlı ve doğru analiz yapmayı mümkün kılarak araştırma sürecini hızlandırmıştır. Buna ek olarak, gizli kalan genetik ilişkileri ve desenlerin ortaya çıkarılmasına olanak tanımıştır. Bu sayede, genetik verilerin derinlemesine analizi ile hasta ve kontrol gruplarına ait verilerin yüksek doğruluk yüzdesiyle sınıflandırılmasına dair bulgular elde edilmiştir. Bu hesaplama tekniklerinin entegrasyonu, mTOR yolu genetik verilerinin analizinde tu- tarlı çözümler sunarak, çeşitli fenotipik sonuçları destekleyen genetik mekanizmaların daha iyi anlaşılmasını sağlamıştır. Sonuç olarak, bu yöntemler, genetik araştırmalarda daha derinlemesine ve kapsamlı analizler yapmamıza olanak tanıyarak, genetik varyasyonların hastalıklar üzerindeki etkilerini daha iyi kavramamızı sağlamıştır. Tez kapsamında, çeşitli fenotipik sonuçlarla ilişkilendirilen karmaşık genetik desenlerin daha derinlemesine anlaşılmasına katkı sağlayacak ve yüksek sınıflandırma performansı sağlayacak hesaplama yöntemlerin incelenmesi amaçlanmıştır. Bu kapsamda, mTOR yolu ile ilişkili genetik verilerin analizi için güncel hesaplama teknikleri kullanılarak daha ayrıntılı sonuçlar elde edilmeye çalışılmıştır. mTOR yolu kaynaklı genetik verilere odaklanılarak, genlere ait nitelikleri açığa çıkarmak için güncel hesaplama metodlarını kullanılmıştır. Temel amaç, CGR, VARCH ve HDMR yöntemlerinin gen ağ analizlerindeki etkinliğini değerlendirmek olup bu yöntemlerin etkinliğini değerlendirirken, her birinin genetik verileri ne kadar yüksek oranda temsil edebildiği ve analiz sonuçlarının doğruluğunun incelenmesidir. Elde edilen genetik veriler sınıflandırma yöntemi ile analiz edildiğinde, önemli sonuçlar elde edildiği gözlemlenmiştir. Hem CGR hem de VARCH yöntemleri genetik sınıflamada dikkate değer bir doğruluk gösterdi. Bu doğruluk oranları, genetik varyasyonların doğru bir şekilde tespit edilmesi ve sınıflandırılması açısından önemli bir başarıdır. VARCH metodu hem dengeli dağılmış datasetinde hem de dengesiz (imbalanced) datasetlerle, doğruluk ve hassasiyet metrikleri açısından CGR’ye göre belirgin bir avantaj sağladı. VARCH yöntemi bu üstünlüğü, genetik verilerin daha hassas ve detaylı bir şekilde analiz edilmesine olanak tanımaktadır. Bu üstünlük, VARCH yöntemi hata oranlarını önemli ölçüde azaltma yeteneği ile daha da vurgulandı. Özellikle, genetik hastalıkların erken teşhisi ve tedavi stratejilerinin geliştirilmesinde VARCH yöntemi kritik bir rol oynayabilir. Bununla birlikte, her yöntemin ilişkilendirilen hesaplama maliyetlerini detaylı olarak inceledik ve hesaplama karmaşıklığı ile doğruluk arasındaki denge hakkında bir anlayış sağladık. Bu denge, yöntemlerin pratik uygulamalarda ne kadar verimli olabileceğini ve büyük veri setleriyle çalışırken hangi yöntemin daha uygun olacağını belirlememize yardımcı oldu. Optimize edilecek parametrelerin daha fazla ve performansı CGR’den daha iyi olmasına karşın VARCH’ın hesaplama gereksinimleri açık bir şekilde ortaya konmuştur. Bu çalışma, gen karmaşıklıklarını çözme konusunda hesaplama araçlarının potansiyelini ortaya koyarken, hesaplama kısıtlamalarını dikkatlice aşmanın önemini vurgulamakta ve araştırmacıların en uygun yöntemi seçme ve optimize etme konusunda rehberlik etmektedir. xxv xxvi 1. INTRODUCTION Human genetic data, including all of an individual’s genetic information in DNA, are the key to understanding the underlying principles of heredity, genetic variation, and disease susceptibility. Human genome data is the entire collection of genetic information encoded in a human organism’s DNA. This contains all the genes (coding sections) and non-coding DNA regions, such as regulatory sequences and repetitive patterns, that comprise the whole genome [1]. In conclusion, human genetic data focuses on particular variants or sequences within the genome, whereas human genome data includes an individual’s or species’ complete genetic blueprint. Multimodal data collection is a critical component in unraveling the complicated fabric of human biology, providing essential insights into the genetic drivers of many phenotypic characteristics, complex illnesses, and fundamental biological processes [2]. Human genetic and gene data studies are widely investigating the genetic composition of individuals or groups to understand the complexities of genetics and genomics and its implications for health and diseases. These extensive studies explore the depths of genetic data to identify distinct genetic variants, such as mutations or single nucleotide polymorphisms (SNPs), and their relationships to specific phenotypes, illnesses, or behaviors [3]. Complex genotyping techniques, high-throughput genomic sequencing, and intricate bioinformatics analyses are just a few sophisticated tools and techniques that researchers have used to carefully examine the unique structure, functionality, and diversity of the human genome. Genome-Wide Association Studies (GWAS) are a prominent method to identify genetic variations linked to complicated disorders. The early 2000s saw the rise in popularity of GWAS. Since then, GWAS have gained popularity as a method for locating genetic variations linked to both standard and complicated disorders [4] [5]. GWAS examines all genetic differences in the genome to determine the potential relationship between these variations and the likelihood of acquiring 1 particular characteristics or diseases. In order to discover information about the genetic components of complex disorders, GWAS uses SNPs to locate and map genetic variations linked to various traits and diseases. For that purpose, researchers have used large databases of SNPs to investigate gene-gene and gene-environment interactions and evaluate associations between genetic variations and diseases. GWAS have proven helpful in locating shared genomic variations associated with different traits and illnesses [5] [6]. The knowledge gained from these initiatives also opens the door for creating and applying personalized medicine strategies that target customizing medical interventions and treatments to each patient’s unique genetic profile. This enables healthcare delivery and enhance patient outcomes. In genomics research, pathways analysis is performed to examine the connections and interactions between genes and the outputs they produce within biological pathways [7]. These pathways are sets of chemical reactions or signaling cascades that control several aspects of the cell, including metabolism, cell cycle, and the immune system. Genes are categorized in route analysis according to their biological roles or participation in particular pathways. This arrangement enables scientists to evaluate the collective activity of genes within a pathway and to comprehend the potential effects of genetic variants or changes in gene expression on biological processes. Pathway analysis frequently integrates omics data such as gene expression, SNP data, or protein-protein interactions to identify dysregulated pathways linked to diseases. The growing number of GWAS has necessitated advances in computing techniques and processes used to analyze large amounts of data. This mutually beneficial development have been emerged necessary by the increasing intricacy of genetic research and the urgent need for increased analytical precision. For this reasons, advancements in computational techniques and computers have been a significant factor in the success of human genetic research. Computational developments have sped up the processing and interpreting of enormous volumes of genomic data, starting with early computational models and statistical methods and continuing with high-performance computing and machine learning techniques [8] [9] [10]. The development of genome browsers, data visualization platforms, and bioinformatic tools has facilitated the finding of new information in human genetics [11] [12]. Additionally, these 2 technologies have simplified data mining and integration. Furthermore, open-access data repositories and computational tools have expedited scientific discoveries and encouraged collaborative research, contributing to the globalization of genetic data access. Initially, GWAS used traditional statistical techniques to determine the relationships between genetic variations and phenotypic characteristics. These approaches usually require the evaluation of hypotheses using methods such as logistic regression or chi-squared tests [13] [14]. However, as datasets became larger and more complicated, traditional statistical methods found it difficult to handle the entire volume of genetic data and account for confounding variables such as population segmentation and repeated testing modifications. In GWAS, logistic regression with LASSO regularization is commonly used for SNP preselection and has proven effective in classifying especially Crohn’s Disease patients [15]. This involves using penalized regression algorithms like LASSO (L1 penalty) and ridge regression (L2 penalty) to model phenotypes as a linear sum of genetic variants, with regularization in place to limit coefficient magnitude. Ensemble learning algorithms such as Random Forest play a valuable role in genetic data analysis by capturing complex genetic relationships and identifying significant variants [16] [17]. In addition, recent studies have highlighted the effectiveness of neural networks in GWAS. For instance, CNN architecture is used to predict phenotypes [18] [19], and dense neural networks are used to distinguish between patients and controls [15]. These methods show potential in identifying significant genetic markers and predicting disease phenotypes. 1.1 Purpose of the Thesis Variant analyses are essential in biomedical research because they help us comprehend various biological processes and disorders. Researchers can get critical insights into the genetic foundation of illnesses and responsiveness to therapies by analyzing an individual’s genetic composition. Furthermore, genome analysis makes it possible to identify genetic variants, mutations, and biomarkers linked to the development or risk of disease. Personalized medicine benefits greatly from this knowledge as it enables 3 medical professionals to customize interventions and therapies based on a patient’s genetic profile. Additionally, variant analysis aids in the discovery of new therapeutic targets and the creation of focused medicines, both of which eventually enhance patient outcomes. GWAS often face several challenges. Firstly, the significant computational costs, especially with multidimensional data, can be an obstacle. Genetic data analysis usually involves numerous variables, leading to computational complexities requiring substantial resources and time. Secondly, the high dimensionality of genetic datasets exacerbates these issues. The immense volume of data increases the computational burden and can obstruct efficient analysis. Additionally, these factors may affect classification accuracy rates in variant analyses. The complexity of the data and computational constraints can impair the effectiveness of classification algorithms. This can result in low-accurate classification into different groups, potentially leading to suboptimal performance and reduced classification accuracy. Addressing these issues is crucial for improving variant analysis techniques’ effectiveness, accuracy, and scalability. Ultimately, this will enhance our understanding of genetic phenomena and their impacts on health and disease. In this thesis, we study utilized two genetic visualization tools and a multidimensional data decomposition method called High Dimensional Model Representation to enable gene classification between patient and control groups. The aim was to understand better genetic patterns associated with different phenotypic outcomes. The visualization tools enabled us to identify potential biological indicators or genetic signatures that distinguish between the two groups by providing insights into the genome data’s underlying structure. The multidimensional data decomposition technique helped manage the complexity of the genome datasets, making our classification study more accurate and efficient. The combined use of these techniques facilitated the examination and categorization of gene sequences, advancing the field of genetics and paving the way for personalized therapy and disease diagnosis. As the proposed method, we used the Chaos Game Representation (CGR) as our visualization technique. This method allowed us to create CGR images by adjusting 4 the resolution according to the chosen k-mer value [20]. We aligned the CGR images of each individual within their respective patient and control groups, which resulted in two tensors, one for each group. These 3-D tensors served as the basis for further genetic data analysis, aiding in identifying trends and differences between the groups. The second visualization technique we utilized is the VARCH method [21]. This method transformed the input gene sequences into VARCH images based on four parameters that control the representations’ accuracy. We tested if these parameters could enhance accuracy compared to CGR. Similar to CGR, VARCH produced two tensors, each representing a group. These tensors were also used for further comparative analysis and for exploring genetic patterns related to different phenotypic outcomes. In both methods, we used the High Dimensional Model Representation (HDMR) for data decorrelation, dimension reduction, and feature extraction tasks [22]. After applying HDMR, we transformed the individual images into three vectors and combined them into a single vector for each participant. This resulted in one feature vector for each sample in the patient and control groups. We then employed a Support Vector Machines (SVM) algorithm on these feature vectors to classify individuals into their groups based on extracted genetic data. Finally, we analyzed the feature matrices using the SVM algorithm and compared the classification accuracy of the visualization methods. We also fine-tuned the method parameters to observe their impact on results. Besides classification accuracy, we considered other machine learning metrics such as F1 score, sensitivity, and binary cross-entropy (BCE) loss. We evaluated computational costs throughout these processes, aiming to assess the performance of the visualization methods in terms of classification accuracy and their efficacy in capturing relevant patterns and optimizing model performance. The results from the analyses are discussed in detail in Section 4. This section also explores the implications of these findings and suggests potential areas for future research. Section 5 focuses on addressing the limitations of the proposed method and suggesting directions for further investigation, which involves refining the 5 methodologies used, exploring additional parameters or techniques, and broadening the scope of analysis to fill any remaining gaps or uncertainties. In highlighting these areas for improvement and future exploration, the study aims to contribute to ongoing advancements in the field and inspire further research. 1.2 Literature Review Advancements in genomic research, mainly through GWAS, have significantly enhanced our understanding of genetic variations and their impact on human health and diseases. GWAS focuses on identifying genetic variations associated with diseases. These studies analyze genetic markers, especially SNPs, across the entire genome to uncover how these variations may influence disease risk [3] [4]. The methodology of GWAS involves comparing the genomes of individuals with a specific disease to those without, aiming to detect correlations between specific genetic variants and disease presence. This approach helps pinpoint specific genetic variances linked to diseases. To ensure the associations’ reliability, GWAS requires analyzing many individuals. One of the strengths of GWAS is its use of statistical methods to determine the importance of genetic associations. Furthermore, GWAS often includes diverse populations to capture the full spectrum of genetic variations across different ethnic groups, enhancing the generalizability of the findings [4]. The implications of GWAS are profound for precision medicine. Findings from these studies contribute to risk prediction, improve diagnostic accuracy, and prepare for personalized treatment strategies. By understanding the genetic basis of diseases, medical professionals can develop more targeted therapies tailored to an individual’s genetic makeup, thereby increasing efficacy and reducing side effects. Despite the transformative potential of GWAS, it has its own set of difficulties. A significant challenge is the need for large sample sizes to detect meaningful genetic associations, especially for complex diseases influenced by multiple genetic and environmental factors. Additionally, understanding the complexities of genetic interactions and their contributions to diseases remains challenging, as these interactions often involve multiple genes and pathways. 6 Also, to better understand the SNPs that are the subject of GWAS, SNPs are common genetic variations that occur when a single nucleotide (A, T, C, or G) at a specific position in the genome differs among individuals in a population. They are the most abundant form of genetic variation in the human genome and play a significant role in determining an individual’s traits, susceptibility to diseases, and response to medications. The interrelationship between SNPs and pathway analysis is crucial for understanding the genetic basis of complex characteristics and disorders. Pathway analysis is a technique that elucidates how genetic variations impact biological processes. This is achieved by mapping SNPs to genes and subsequently performing gene-set enrichment analysis [3]. Considering the cumulative effects of many SNPs within pathways, this technique significantly improves statistical power and sheds light on the biological processes driving genetic relationships. Key points about SNPs include: • Genetic Variation: SNPs represent variations in a single nucleotide within the DNA sequence. • Biological Significance: SNPs can influence gene expression, protein function, and overall phenotype. They are associated with various traits, diseases, and drug responses. • GWAS: SNPs are commonly used as genetic markers in GWAS to identify associations between specific genetic variants and diseases or traits. • Population Genetics: SNPs can vary in frequency across different populations, providing insights into genetic diversity and evolutionary relationships. • Clinical Applications: Understanding SNPs is crucial in personalized medicine, as specific SNPs can impact an individual’s risk of developing related diseases or their response to specific treatments. Gene networks, also called gene regulatory networks, represent a comprehensive assembly of genes and their interactions dictating the expression levels of genes within a cell or organism [19]. These networks show the connections among genes, 7 transcription factors, and regulatory molecules and how their interactions affect their expression. They provide insights into complex regulatory mechanisms that control cellular functions such as development and response to environmental triggers. Through mapping these relationships, researchers can better understand how genes work together to sustain biological functions and how variations to the network result in disease. With their comprehensive perspective on gene regulation, gene networks are crucial for understanding molecular pathways in biology and diseases. For a better understanding of the gene network-based approaches, some datasets will be mentioned: Protein-Protein interaction(PPI) databases such as BioGRID, IntAct, and STRING are valuable for understanding protein functions and cellular processes [23]. Researchers investigate these processes by documenting physical interactions among proteins. Gene expression datasets are derived from microarray or RNA sequencing studies and determine co-expression relationships among genes, which suggest potential regulatory connections [24]. Transcription factor binding datasets obtained from chromatin immunoprecipitation (ChIP) experiments disclose regulatory interactions between transcription factors and their target genes [25]. MicroRNA target databases contain predicted and experimentally validated interactions between microRNAs and target genes, providing insight into post-transcriptional gene regulation [26]. Disease-specific datasets incorporate information on genetic mutations, gene expression alterations, and phenotypic changes. These facilitate investigating how gene networks are disturbed in disease conditions. Scientists use datasets from diverse experimental strategies to construct and analyze gene networks. Integration and analysis of these diverse datasets are pivotal in constructing comprehensive gene regulatory networks that facilitate unraveling the intricate complexities of gene regulation and cellular processes. Various computational and statistical methods are employed in this endeavor. Network inference algorithms, such as Bayesian networks and mutual information-based methods, leverage gene expression data to infer regulatory interactions between considered genes [27]. Clustering algorithms aid in identifying functional modules within the network by grouping genes with similar expression patterns. Network topology analysis entails studying the structural properties of the network, such as centrality measures and 8 degree distribution to comprehend its organization and dynamics. Pathway analysis identifies biological pathways enriched with genes of interest, providing crucial insights into the functional roles of gene networks [28]. By combining these methodological approaches, researchers can explain the regulatory relationships within gene networks, uncover vital regulatory mechanisms, and understand how genes coordinate their activities to govern cellular processes. Gene networks thus serve as essential tools in interpreting the molecular mechanisms supporting biological processes and diseases, offering a holistic perspective on gene regulation within biological systems. Considering all this information, that is a known fact that many common disorders are known to be polygenic today, yet the specific genetic variant combinations remain unidentified [29]. An individual’s genetic risk for a particular trait or disease is measured by polygenic risk scores (PRS), commonly calculated by summing the weighted effect sizes of risk alleles at multiple genetic variants [30]. The search for those variants and their combined effects presents significant challenges due to the many possible combinations. Also, incomplete genetic variant information and imprecise effect sizes can affect the accuracy of PRS. Moreover, tagging SNPs instead of causal variants limits the precision of PRS. Given that the current catalog of genetic variants associated with various disorders is incomplete, substantial improvements are necessary to enhance the accuracy of PRS calculations. New methodologies are being developed to address these challenges, such as Bayesian LDpred [31], which incorporates Linkage Disequilibrium (LD), and SBayesR [32], which uses Bayesian multiple regression on summary statistics. Ongoing research aims to improve PRS for disease risk prediction and individualized healthcare strategies by strengthening their clinical utility and dependability. As mentioned, in gene studies, datasets are generally subjected to statistical analysis. These techniques include population stratification control, association testing, meta-analysis, imputation, mixed-effects models, and data quality control [8] [33] [34] [35] [36] [37]. While these methods have yielded remarkable results, they have limitations: population stratification risks false associations, multiple testing increases false positives, small samples hinder detection, missing heritability suggests 9 incomplete understanding, focusing on common variants may overlook rare ones, and analyzing gene-environment interactions is complex. The complexity of the analysis and computational cost necessitates exploring new solutions, especially considering the size of the genome data. Converting genetic sequences into 2D images represents a promising solution to address the challenges posed by multidimensional computations and data complexities in genome studies. These techniques involve representing genetic sequences as visual images, simplifying the computational analysis and interpretation of complex genomic data. Several studies have highlighted the efficacy of this approach, demonstrating that the transformation of genetic sequences into 2D images can lead to acceptable classification accuracies [38]. Researchers have analyzed patient/control genotype data as transformed images using 1D Convolutional Neural Network (1DCNN) and 2D Convolutional Neural Network (2DCNN) architectures. The ability of these neural network algorithms to recognize characteristics and patterns in images allows for accurate genome data classification [19]. High Dimensional Model Representation (HDMR) is a powerful tool for managing and analyzing complex systems. This thesis uses this method to handle the converted 2D images of gene sequences. This method has been extensively applied in engineering, environmental sciences, finance, and biological systems [39]. In gene networks, HDMR presents comprehensive and network-specific gene variant analyses. It is proven that the effectiveness of HDMR as a method for analyzing genetic data and establishes a foundation for future applications of this approach to diploid sequencing data [40]. 10 2. METHODS 2.1 Chaos Game Representation The Chaos Game Representation approach uses ideas from chaotic systems and nonlinear dynamics to represent genetic sequences visually [20]. An elaborate pattern known as an attractor is produced by the Chaos Game algorithm, which assigns DNA bases to vertices inside a square grid. These attractors reveal the basic structural features inherent in DNA sequences, making it possible to observe the patterns of both localized and global sequences. The CGRs exhibited self-similar patterns recurring at different scales. This similarity highlights gene sequences with repeating motifs. Furthermore, every point on CGR represents a distinct DNA subsequence, making it easier to identify sequence trends. 1. Initialization: In a two-dimensional Cartesian coordinate system, the origin (0, 0) is where the Chaos Game Representation (CGR) begins. 2. Nucleotide Assignment: Adenine (A), thymine (T), guanine (G), and cytosine (C) were assigned specific coordinates. Adenine (A) is linked to (-1, 1) and is positioned in the upper-left quadrant. Thymine (T) is represented by (1, -1) in the lower right quadrant. Guanine (G) occupies (1, 1) and is in the upper-right quadrant. Cytosine (C), found in the lower-left quadrant, was designated as (-1,-1). 3. Iterative Process: The nucleotide sequences of genes were processed iteratively using the CGR algorithm. Like k−mer tables or forward Markov chain analysis, this is frequently performed in reverse order. The actions listed here were performed for each nucleotide. The halfway point is located between the current position and the designated coordinate. The new location is then marked by moving to this halfway point. 11 Figure 2.1 : 32×32 CGR images of mTOR gene: GSK3B (top-left), RICTOR (top- right), SLC38A9 (bottom-left), MTOR (bottom-right). 4. Repetitive Process: This procedure is reiterated for each nucleotide in the gene sequence until all nucleotides have been addressed. An example of some CGR images represented in Figure 2.1 belongs to the mTOR gene network. A subsequence of length in an RNA or DNA sequence is known as a k−mer, an essential building block for analyzing and comprehending genomic sequences in computational biology and bioinformatics. DNA sequences of different lengths may be seen owing to the resolution of the k−mer method, which provides information on the composition and structure of the sequences. Furthermore, the correlation between CGR points and sequences shows that mathematical representations of CGR reflect 12 . . Control 1 ATTATGTCATTGTGGGATCCTAATATCATTCAATAGCTGA.AGTTCG CGR images to CGR tensors Sequence data 31 genes of mTOR dataset mTOR data sequence to CGR image AGCCCCATTCCACAGCCACATGATTCAGTAAATTTGGA..GCGTCG TGAAGTGACATTGTGTCCTGCATCAGATCAAGAGGTAC..ATGATT CTGCTTCCCTTTCCCGGGCACTGTAGTTGGCTAAAG..CTTGAC TAGGACAGGCCAGTTAGAATGCTGTGGGTTTAGCTG..GTTTCCT Control 2 Control 3 Control 400 Control 399 .. . . ACTGAAACCCGTCAATATGAACCTCCGAGTACGAGGT..AAGCCC GCCCTTCTCCGGGCCTCGGGCTGGCTCGGTACGAGG..AGTCTT GCTGTAGCTTGGAGTGTAACAGCTATACAGAATGGCC..TGTGTA CTGCTTCCCTTTCCCTGGGACTGTAGTTGGCTAAAG..CTTGAC TAGGTACACATCTGATTTTTGGCTTAAGCTCGTCGCAT..CTATCCC Patient 1 Patient 2 Patient 3 Patient 400 Patient 399 . . Figure 2.2 : CGR process of genomic sequence data. features of the underlying DNA sequence, making it easier to analyze sequential and statistical qualities [41]. In our study, we systematically explored a range of k−mer lengths within the range of: k −mer= {3,4,5,6,7,8,9}. Moreover, evaluated their impact on the analytical outcomes. To achieve this, we assessed the classification accuracy of the control and patient groups. By analyzing the performance of different k−mer lengths, we aimed to identify the optimal parameter that enhances the accuracy of distinguishing between these groups. This comprehensive evaluation involved comparing the classification results to determine how variations in k−mer length influence the overall effectiveness of the analysis. Following determining the optimal k − mer length, a CGR tensor was constructed through the alignment of the CGR images extracted from all sequences within the patient group. This iterative process was replicated for the control group. Thus, our analysis yielded two distinct CGR tensor classes, representing the patient and control groups. As represented in Figure 2.2, after the conversion of genomic sequences into FASTA format, CGR images were produced based on the designated k−mer count. The CGR images, corresponding to each respective group, control, and patient, were then aligned in identical sequences to form CGR tensors. 13 2.2 VARCH VARCH is a novel method that uses a variant map, which is essential for constructing variant logic and visualizing DNA sequences. This approach counts the frequency of base pairings inside each DNA subsequence, making the visualization process easier to understand. These combinations are then used to create a matrix projected onto a two-dimensional coordinate plane to represent DNA sequences visually. Compared to traditional techniques, VARCH methodology provides a more intuitive and easily interpretable representation of DNA sequence properties [21]. A new logical construct, Variant Logic, was developed to broaden the scope of Boolean functions. It incorporates four fundamental relationships, which signify the primary types of change patterns: 0 to 0, 0 to 1, 1 to 0, and 1 to 1 [42]. The meta-operators represent these relationships ({⊥,+,−, ⊥ }) in the Variant Logic Constructor (VLC), which corresponds to the mentioned meta change patterns as represented in Table 2.1 [21]. Table 2.1 : The relationships in variant logic construction. Input Output Truth False Variant Invariant Operator 0 0 0 1 0 1 ⊥ 0 1 1 0 1 0 + 1 0 0 1 1 0 − 1 1 1 0 0 1 ⊥ As shown in Table 2.2, the four meta-operators may be used in various ways to provide 16 distinct combinations, which provides a more thorough understanding. Each combination provides a strong foundation for examining and extending the logical domain of Boolean functions, which reflects a different application of the meta operators. The VLC may be used to precisely characterize and manipulate logical connections and alter patterns thanks to this extensive collection of combinations. The primary constraint faced with implementing this method is the inconsistency in the lengths of data sequences. It is a common observation, not just within our investigation but also across various species, that the lengths of DNA sequences display substantial variation when multiple genes are analyzed collectively to yield a comprehensive 14 Table 2.2 : The relationship between subsets and values. Name Value Base T0 /0 /0 T1 N⊥ NC T2 N+ NA T3 N− NT T4 N⊤ NG T5 N⊥ + N+ NC + NA T6 N⊥ + N− NC + NT T7 N⊥ + N⊤ NC + NG T8 N+ + N− NA + NT T9 N+ + N⊤ NA + NG T10 N− + N⊤ NT + NG T11 N⊥ + N+ + N− NC + NA + NT T12 N⊥ + N+ + N⊤ NC + NA + NG T13 N⊥ + N− + N⊤ NA + NC + NT T14 N+ + N− + N⊤ NA+ NT + NG T15 N⊥ + N+ + N− + N⊤ NC + NA + NT + NG variant analysis. This inconsistency in data length presents a significant challenge in achieving a reliable and accurate analysis. In the subsequent steps of our study, as we examine the 16 unique combinations and look for accurate matches, we will be partitioning the DNA sequences into equal sub-lengths. This division aims to achieve a more balanced and equal data distribution for the further analysis. Following the methodology adopted in the original study introducing the VARCH method, a single division process will be employed to derive first-degree subsequences of equal length. By doing so, we aim to overcome the inherent challenges of variable data length, thereby improving the reliability and accuracy of our analytical outcomes. The connection may be written as in Equation 2.1, if the length of a DNA sequence is represented by N, the length of a subsequence by Len, and the number of subsequences by m. N = Len×m (2.1) 15 ACTGAAACCCGTCAATATGGCGGCGGAACCTCCGAGTACGAGGT...AAGCCC ..ACTGAAA CCCGTCA ATATGGC TAAGCCC 2.nd subsequence1.st subsequence 3.rd subsequence m.th subsequence First order subsegmentation Figure 2.3 : Subsegmentation of the genomic sequence data. Every participant in our study has had their gene sequences subjected to this procedure, which guarantees that the total length of the subsequence is divided into the subsegment’s length, as seen in Figure 2.3. This methodological technique makes accurate comparisons and thorough analyses possible by enabling a uniform and consistent study across all DNA sequences. After the initial step of obtaining the subsequences from the main dataset, we can proceed to the calculation of the matrix, Mat(A). This is achieved by utilizing the corresponding subset numbers associated with each subsequence. It is important to note that there are 16 unique configurations, denoted from T0 to T15. Additionally, we have a number of subsequences derived from the original data. Considering these variables, the resulting matrix, Mat(A) ∈ Rm×16 will have dimensions of m × 16 a relationship clearly illustrated in Equation 2.1. This configuration was chosen because it allows for a comprehensive and complete data representation. It successfully captures the full range of possible combinations within the subsequences, thus providing a solid foundation for further analysis and interpretation of the data. Mat(A) =  T 1 0 T 1 1 · · · T 1 j · · · T 1 15 T 2 0 T 2 1 · · · T 2 j · · · T 2 15 ... ... . . . ... T i 0 T i 1 · · · T i j · · · T i 15 ... ... . . . ... T m 0 T m 1 · · · T m j · · · T m 15  (2.2) 16 Equation 2.2 represents how matrix A, comprehensively compiles all potential combinations. The task at hand is to choose a pair of Tα and Tβ that will be used to construct the visualization matrix Mat(B) based on the accuracy findings. The challenge lies in determining which alpha (α) and beta (β ) combination most effectively encapsulates the patterns and information embedded within the DNA sequence. Given the complexity of this task, a comprehensive approach was taken. Every possible pair, every Mat(B) combination, has been created. The performance of these matches was then evaluated for classification accuracy, providing a quantifiable measure to determine their effectiveness. After careful analysis that examined accuracy, F1 score, recall, and loss results, the Tα and Tβ pairing that provided the highest performance was selected. Once this pair was identified, the process moved on to the next step. This involved recording the frequency information into Mat(B) as Equation 2.3 represented. The information on how many pairs (Tα ,Tβ ) are present in the corresponding elements can be called frequency information, the order, and the number of matches associated with this selected (Tα ,Tβ ) pair in the resulting Mat(A) are also documented. This comprehensive approach allows us to capture the most accurate and detailed representation of the gene sequence. Mat(B) =  b11 b12 · · · b1h · · · b1m b21 b22 · · · b2h · · · b2m ... ... . . . ... bi1 bi2 · · · bih · · · bim ... ... . . . ... bm1 bm2 · · · bmh · · · bmm  (2.3) where (T i α ,T h β ) coordinate’s frequency value is bi j. The coordinate (T i α ,T h β ) denotes that the two configurations align with specific elements within the matrix Mat(A) as (T 1 α ,T 1 β ), (T m α ,T m β ) , ... , (T i α ,T h β ) while 1 ≤ i ≤ m, 1 ≤ h ≤ m [21]. For an example, as shown in Figure 2.4 the mTOR gene sequence VARCH images, represented with heatmap data visualizationn tool, varying with the Tα and Tβ values show that there is a symmetry between the four bases and 16 configurations. 17 Figure 2.4 : VARCH images of varying Tα and Tβ values (Len = 10, m = 200). Our primary focus is to maximize the amount of vital information that can be extracted from this step. Doing so can significantly support generalization performance, leading to more accurate and reliable results that can further our understanding. In light of this, we have decided to limit the gene sequence length. This decision ensured that the sequence length matched the shortest gene sequence’s length in our mTOR dataset. The equation that has been the subject of our study is represented as Equation 2.1 and involves two variable parameters, which must be adjusted during the calculation process. These parameters are specifically referred to as parameters Len and m. The pivotal element within this equation is the Len parameter, which is the subsequence length parameter. Its significance and integral role in determining the outcomes of our calculations were thoroughly proven and elaborated upon in our work. This finding is demonstrated through the accuracy of a Support Vector Machines classification, discussed in Section 4. This section comprehensively analyzes the Len parameter’s role and impact on the outcome, further underlining its importance in our research. 18 After performing a series of computations, we generated two tensors of size [m×m] for the patient and control groups. These were created by collating the VARCH images. It is important to note that the size of these tensors varies according to the content of the datasets used in each study. This variation is solely determined by the number of subsequences examined in the study. The VARCH tensor is obtained by aligning the frequency matrices B, Mat(B), which we obtained, with the data from the patient and control groups. These tensors will be input into the HDMR method similarly to the CGR tensor. 2.3 High Dimensional Model Representation High Dimensional model representation (HDMR) is a technique used to predict complex relationships between input and output variables in high dimensional systems [22] [39]. High Dimensional Model Representation allows for the representation of a multivariate function f (x1,x2, . . . ,xN) as a finite sum of functions that have fewer independent variables and are mutually orthogonal. This method, which assumes minimal high-order interaction effects, breaks down the output function into hierarchical components. This allows it to represent the independent or combined effects of the input variables. The computational efficiency of HDMR remains constant regardless of the size of the function by utilizing sampling methods, such as Monte Carlo integration, to handle dispersed data points effectively. Error analysis and comparison with the test datasets are used for validation and assessment. The procedure comprises choosing basis functions, determining sample sizes, and using predetermined computational techniques. Overall, a high dimensional multivariate representation of functions offers a methodical and efficient option. Finding patterns and demonstrating relationships in genetic operational analysis research is computationally expensive owing to its large dimensionality and complicated challenges. However, the HDMR method should be considered a solution for these problems in gene research [43] [44]. The HDMR operates by splitting a multivariate function into a sum of hierarchical functions [22]. 19 f (x1, . . . ,xN) = f0 + N ∑ i1=1 fi1(xi1)+ N ∑ i1,i2=1 i1