A deep learning architecture for missing metabolite concentration prediction

Çelik, Sadi

A deep learning architecture for missing metabolite concentration prediction

dc.contributor.advisor	Çakmak, Ali
dc.contributor.author	Çelik, Sadi
dc.contributor.authorID	504211530
dc.contributor.department	Computer Engineering
dc.date.accessioned	2025-03-20T11:42:07Z
dc.date.available	2025-03-20T11:42:07Z
dc.date.issued	2024-07-12
dc.description	Thesis (M.Sc.) -- İstanbul Technical University, Graduate School, 2024
dc.description.abstract	In the last decade, the use of deep learning methods for the diagnosis and treatment of diseases has become a widespread practice in the field of bioinformatics. Metabolomics is an omics science dealing with the identification and measurement of all metabolites in an organism, and can provide a comprehensive analysis of the metabolic profile both in physiological and pathological conditions. Metabolomics data serve as a measure of metabolic function. In particular, relative ratios and perturbations that are outside of the normal range signify disease conditions. The analysis workflow of metabolomics data involves the application of different bioinformatics tools. The quantification of metabolites is accomplished by the utilization of a wide range of different combinations of mass spectrometry (MS) in conjunction with liquid chromatography (LC), gas chromatography (GC), and nuclear magnetic resonance (NMR) techniques. For a proper biological interpretation of metabolomic datasets and powerful data analysis, preprocessing is essential to ensure high data quality. In various studies containing metabolite measurements, missing values in the data may affect the performance of the analysis results significantly. In recent years, the application of deep learning-based generative models for the accurate imputation of missing values has gained popularity. Unsupervised generative models like variational autoencoders (VAE) can impute missing values to perform more powerful data analysis. This work aims to develop effective models that can accurately predict missing metabolite values in metabolomics datasets. To this end, a number of human metabolomics studies from the Metabolomics Workbench and MetaboLights databases are collected. These datasets are heterogeneous in terms of their metabolite sets and the underlying experimental technologies that generated them. Hence, it is challenging to utilize these diverse datasets together to train imputation models. To tackle this challenge, we propose three different models and dataset merging strategies, namely, Union-based Merging, Iterative Similarity-based Merging, and Model-guided Agglomerative Merging. We perform several experiments to determine the optimal setup configuration for the training pipeline. This includes finding the best initial missing value imputation approach and the most effective data pretreatment scheme. After handling the original missing values and applying a preprocessing pipeline to the input data, k-fold cross-validation is carried out to ensure consistent and reliable model evaluation. Before training, random missingness simulations are performed to mimic different missing value patterns in clinical datasets, and the models are trained with those patterns. During our empirical evaluation, we observe that the complexity drawback of IterativeImputer with RandomForestRegressor is more evident in larger datasets. For this reason, KNNImputer method is chosen as the standard missing filling method for initial missing values in our proposed merging approaches. Moreover, the application of log transformations with different bases and the Yeo-Johnson transformation of the datasets results in improved VAE model performances. Furthermore, our experimental results show that the performance of the proposed framework scales over large datasets to create accurate metabolite- and dataset-independent imputation models to predict missing values in metabolomics studies.
dc.description.degree	M.Sc.
dc.identifier.uri	http://hdl.handle.net/11527/26658
dc.language.iso	en_US
dc.publisher	Graduate School
dc.sdg.type	Goal 3: Good Health and Well-being
dc.subject	Deep learning
dc.subject	Derin öğrenme
dc.subject	Artificial intelligence
dc.subject	Yapay zeka
dc.title	A deep learning architecture for missing metabolite concentration prediction
dc.title.alternative	Eksik metabolit miktarı tahmini için bir derin öğrenme mimarisi
dc.type	Master Thesis

Dosyalar

Orijinal seri

Şimdi gösteriliyor 1 - 1 / 1

Ad:: 504211530.pdf
Boyut:: 2.97 MB
Format:: Adobe Portable Document Format

İndir

Lisanslı seri

Şimdi gösteriliyor 1 - 1 / 1

Ad:: license.txt
Boyut:: 1.58 KB
Format:: Item-specific license agreed upon to submission
Açıklama

İndir

Koleksiyonlar

LEE- Bilgisayar Mühendisliği-Yüksek Lisans