A deep learning architecture for missing metabolite concentration prediction

dc.contributor.advisor Çakmak, Ali
dc.contributor.author Çelik, Sadi
dc.contributor.authorID 504211530
dc.contributor.department Computer Engineering
dc.date.accessioned 2025-03-20T11:42:07Z
dc.date.available 2025-03-20T11:42:07Z
dc.date.issued 2024-07-12
dc.description Thesis (M.Sc.) -- İstanbul Technical University, Graduate School, 2024
dc.description.abstract In the last decade, the use of deep learning methods for the diagnosis and treatment of diseases has become a widespread practice in the field of bioinformatics. Metabolomics is an omics science dealing with the identification and measurement of all metabolites in an organism, and can provide a comprehensive analysis of the metabolic profile both in physiological and pathological conditions. Metabolomics data serve as a measure of metabolic function. In particular, relative ratios and perturbations that are outside of the normal range signify disease conditions. The analysis workflow of metabolomics data involves the application of different bioinformatics tools. The quantification of metabolites is accomplished by the utilization of a wide range of different combinations of mass spectrometry (MS) in conjunction with liquid chromatography (LC), gas chromatography (GC), and nuclear magnetic resonance (NMR) techniques. For a proper biological interpretation of metabolomic datasets and powerful data analysis, preprocessing is essential to ensure high data quality. In various studies containing metabolite measurements, missing values in the data may affect the performance of the analysis results significantly. In recent years, the application of deep learning-based generative models for the accurate imputation of missing values has gained popularity. Unsupervised generative models like variational autoencoders (VAE) can impute missing values to perform more powerful data analysis. This work aims to develop effective models that can accurately predict missing metabolite values in metabolomics datasets. To this end, a number of human metabolomics studies from the Metabolomics Workbench and MetaboLights databases are collected. These datasets are heterogeneous in terms of their metabolite sets and the underlying experimental technologies that generated them. Hence, it is challenging to utilize these diverse datasets together to train imputation models. To tackle this challenge, we propose three different models and dataset merging strategies, namely, Union-based Merging, Iterative Similarity-based Merging, and Model-guided Agglomerative Merging. We perform several experiments to determine the optimal setup configuration for the training pipeline. This includes finding the best initial missing value imputation approach and the most effective data pretreatment scheme. After handling the original missing values and applying a preprocessing pipeline to the input data, k-fold cross-validation is carried out to ensure consistent and reliable model evaluation. Before training, random missingness simulations are performed to mimic different missing value patterns in clinical datasets, and the models are trained with those patterns. During our empirical evaluation, we observe that the complexity drawback of IterativeImputer with RandomForestRegressor is more evident in larger datasets. For this reason, KNNImputer method is chosen as the standard missing filling method for initial missing values in our proposed merging approaches. Moreover, the application of log transformations with different bases and the Yeo-Johnson transformation of the datasets results in improved VAE model performances. Furthermore, our experimental results show that the performance of the proposed framework scales over large datasets to create accurate metabolite- and dataset-independent imputation models to predict missing values in metabolomics studies.
dc.description.degree M.Sc.
dc.identifier.uri http://hdl.handle.net/11527/26658
dc.language.iso en_US
dc.publisher Graduate School
dc.sdg.type Goal 3: Good Health and Well-being
dc.subject Deep learning
dc.subject Derin öğrenme
dc.subject Artificial intelligence
dc.subject Yapay zeka
dc.title A deep learning architecture for missing metabolite concentration prediction
dc.title.alternative Eksik metabolit miktarı tahmini için bir derin öğrenme mimarisi
dc.type Master Thesis
Dosyalar
Orijinal seri
Şimdi gösteriliyor 1 - 1 / 1
thumbnail.default.alt
Ad:
504211530.pdf
Boyut:
2.97 MB
Format:
Adobe Portable Document Format
Açıklama
Lisanslı seri
Şimdi gösteriliyor 1 - 1 / 1
thumbnail.default.placeholder
Ad:
license.txt
Boyut:
1.58 KB
Format:
Item-specific license agreed upon to submission
Açıklama