Comparative next generation sequencing data analysis
Loading...
Files
Date
Authors
Journal Title
Journal ISSN
Volume Title
Type
Publisher
Graduate School
Abstract
Next-generation sequencing (NGS) technologies have revolutionized genomics and biomedical research by enabling rapid and cost-effective sequencing of DNA and RNA. The widespread adoption of these technologies in both clinical and academic applications has resulted in the accumulation of an immense volume of genomic data. This unprecedented data surge presents substantial computational challenges as analyzing and interpreting NGS data requires sophisticated algorithms and significant computational resources. A typical DNA sequencing analysis pipeline consists of several steps including raw data quality control and adapter removal, aligning raw reads to a reference genome, duplicated read removal, variant calling and variant annotation. A plethora of bioinformatics tools exists for each of these steps, each with distinct strengths and limitations that influence tool selection by researchers and practitioners. The inherent characteristics of sequencing data, such as read length, coverage depth, and error profiles, significantly impact tool performance, necessitating systematic comparative evaluation for optimal pipeline configuration. Cancer, a genetic disease primarily driven by somatic mutations, is one of the major research areas that significantly benefits from NGS technologies. The presence of specific genomic alteration can dramatically influence treatment decisions, as modern targeted therapies are designed to address particular genetic variants Therefore, accurate and reliable detection of somatic mutations is crucial for clinical decision-making. The heterogeneous nature of tumor cells presents challenges for variant detection, especially when identifying mutations with low variant allele frequencies, which often fall below conventional detection thresholds. In this thesis, a systemic evaluation of aligners and variant callers on a novel glioblastoma dataset is conducted. The dataset comprises 55 whole-exome sequencing (WES) samples representing various stages of tumor progression, each exhibiting distinct heterogeneity profiles. We evaluated 12 distinct analytical pipelines created by combining 3 mapping algorithms with 4 variant callers, executing 600 independent analyses across the tumor samples. Our findings reveal that the heterogeneity of the sample significantly affects variant calling performance. Furthermore, we demonstrate that ensemble approaches combining multiple pipelines significantly improve the overall performance. To validate our findings from the glioblastoma samples, we generated an in-silico dataset that simulates various heterogeneity profiles. To address the challenges in comparative NGS analysis, we present a computational framework that facilitates the seamless integration of multiple bioinformatics tools into cohesive DNA sequencing analysis pipelines. The framework implements parallelized versions of widely-used tools, achieving up to 8-fold acceleration when utilizing high-performance storage systems. The integrated comparison module enables systematic evaluation of diverse analytical tools, allowing users to efficiently assess tool performance on their specific datasets. Leveraging this functionality, we conducted a comprehensive benchmarking study using the SEQC2 somatic mutation reference dataset, evaluating combinations of 2 alignment algorithms and 6 variant callers. Our analysis reveals significant tool compatibility constraints and demonstrates that several commonly employed variant callers exhibit reduced sensitivity in detecting low-frequency variants, a limitation particularly relevant for cancer genomics applications.
Description
Thesis (M.Sc.) -- Istanbul Technical University, Graduate School, 2025
Subject
Next-generation sequencing (NGS), Yeni Nesil Dizileme (YND), analytical tools, analitik araçlar, cancer, kanser