A deep learning architecture for missing metabolite concentration prediction
A deep learning architecture for missing metabolite concentration prediction
Dosyalar
Tarih
2024-07-12
Yazarlar
Çelik, Sadi
Süreli Yayın başlığı
Süreli Yayın ISSN
Cilt Başlığı
Yayınevi
Graduate School
Özet
In the last decade, the use of deep learning methods for the diagnosis and treatment of diseases has become a widespread practice in the field of bioinformatics. Metabolomics is an omics science dealing with the identification and measurement of all metabolites in an organism, and can provide a comprehensive analysis of the metabolic profile both in physiological and pathological conditions. Metabolomics data serve as a measure of metabolic function. In particular, relative ratios and perturbations that are outside of the normal range signify disease conditions. The analysis workflow of metabolomics data involves the application of different bioinformatics tools. The quantification of metabolites is accomplished by the utilization of a wide range of different combinations of mass spectrometry (MS) in conjunction with liquid chromatography (LC), gas chromatography (GC), and nuclear magnetic resonance (NMR) techniques. For a proper biological interpretation of metabolomic datasets and powerful data analysis, preprocessing is essential to ensure high data quality. In various studies containing metabolite measurements, missing values in the data may affect the performance of the analysis results significantly. In recent years, the application of deep learning-based generative models for the accurate imputation of missing values has gained popularity. Unsupervised generative models like variational autoencoders (VAE) can impute missing values to perform more powerful data analysis. This work aims to develop effective models that can accurately predict missing metabolite values in metabolomics datasets. To this end, a number of human metabolomics studies from the Metabolomics Workbench and MetaboLights databases are collected. These datasets are heterogeneous in terms of their metabolite sets and the underlying experimental technologies that generated them. Hence, it is challenging to utilize these diverse datasets together to train imputation models. To tackle this challenge, we propose three different models and dataset merging strategies, namely, Union-based Merging, Iterative Similarity-based Merging, and Model-guided Agglomerative Merging. We perform several experiments to determine the optimal setup configuration for the training pipeline. This includes finding the best initial missing value imputation approach and the most effective data pretreatment scheme. After handling the original missing values and applying a preprocessing pipeline to the input data, k-fold cross-validation is carried out to ensure consistent and reliable model evaluation. Before training, random missingness simulations are performed to mimic different missing value patterns in clinical datasets, and the models are trained with those patterns. During our empirical evaluation, we observe that the complexity drawback of IterativeImputer with RandomForestRegressor is more evident in larger datasets. For this reason, KNNImputer method is chosen as the standard missing filling method for initial missing values in our proposed merging approaches. Moreover, the application of log transformations with different bases and the Yeo-Johnson transformation of the datasets results in improved VAE model performances. Furthermore, our experimental results show that the performance of the proposed framework scales over large datasets to create accurate metabolite- and dataset-independent imputation models to predict missing values in metabolomics studies.
Açıklama
Thesis (M.Sc.) -- İstanbul Technical University, Graduate School, 2024
Anahtar kelimeler
Deep learning,
Derin öğrenme,
Artificial intelligence,
Yapay zeka