Publication: Estimating metabolic flux variability with machine learning
Loading...
Files
Date
Authors
Advisor
Journal Title
Journal ISSN
Volume Title
Publisher
Graduate School
Type
Abstract
In recent years, the use of artificial intelligence in metabolomics for early disease diagnosis and treatment processes has increased significantly. Metabolomics is the scientific field that studies and analyzes metabolite abundances within an organism. Metabolites are small chemicals (e.g., glucose, cholesterol, amino acids, etc.) that participate in biochemical reactions in cells. Various methods are employed to study metabolic activities in cells. In this study, focus on computing flux intervals of reactions. "Flux" represents the rate of conversion of input metabolites to output metabolites in reactions. Since flux values frequently change to adapt to different changes, one of the common approaches is to study the feasible flux intervals (i.e., min and maximum values of a reaction flux). The most widely used technique for calculating reaction flux intervals is the Flux Variability Analysis (FVA). FVA computes the minimum and maximum flux rates in metabolic networks by solving an optimization problem for each reaction using linear programming. However, this calculation is computationally expensive and time-consuming, often taking hours or even days, depending on the size of the data and the number of reactions in the biological network model. To address this problem, we propose several machine learning techniques, enabling simultaneous flux value predictions for all reactions. The use of prediction models reduces the computational time of FVA from minutes to under a second. This advancement is expected to accelerate metabolomics research and mitigate the problems arising from resource limitations. We employ both traditional machine learning methods and deep learning approaches. Among traditional methods, multi-output regression models such as Random Forest and XGBoost were implemented, each requiring separate training per output variable. On the deep learning side, a variety of architectures, including including Fully Connected Neural Networks (FCNN), Variational AutoEncoders (VAE), Convolutional Neural Networks (CNN), Graph Neural Networks (GNN), Transformers, and Neural Oblivious Decision Ensembles (NODE) are developed to model the relationships between metabolites and their roles. Techniques such as LeakyReLU activation, Batch Normalization, Dropout layers, and the AdamW optimizer were applied to prevent overfitting and gradient issues. Particularly, the GNN architecture leveraged the graph structure of the metabolic network to explicitly capture the detailed relationships among metabolites, reactions, and pathways. The overall objective of these models is to enable rapid and accurate prediction of biochemical pathway changes based on input metabolomics data. For training, a large number of metabolomics datasets encompassing measurements for more than 22,000 individuals are obtained from the Metabolomics Workbench and Metabolights. An independent performance evaluation of the models is performed on independent datasets of five different cancer types, including breast cancer, clear cell renal carcinoma, colon adenocarcinoma, pancreatic adenocarcinoma, and prostate adenocarcinoma. Metabolite values within the datasets were first scaled using a Fold Change Scaler based on the healthy individuals' measurements. After merging the datasets, the standard normalization was applied to the metabolite and pathway change values to reduce the adverse impact of differing data distributions on model performance. Missing values were imputed with zeros. Regression performance was assessed using 10-fold cross-validation and also independently validated on separate cancer datasets. Metrics such as RMSE and MAE were employed for performance evaluation. The regression results indicated that Random Forest (RF) and XGBoost outperformed deep learning models. Among deep learning approaches, CNN and FT-Transformer models showed relatively stronger performance. In total, 14 different methods are studied, with Random Forest, XGBoost, CNN, and FT-Transformer standing out. RF and XGBoost achieved RMSE values of 84.97 and 81.55, respectively. The average disease classification F1 score obtained using ground truth labels was 0.88, while the classification models with the predicted values achieved an F1 score of 0.80. The important feature sets of these models showed over 50\% similarity in term of statistically significant metabolic pathways associated with disease labels. Regarding computational speed, XGBoost was the fastest by far, accelerating the existing solution by a factor of 25,000. Compared to the state-of-the-art in the literature, the proposed methods demonstrated superior performance across all evaluated scenarios. While Random Forest and XGBoost demonstrated comparable performance in many aspects, XGBoost's superior speed led to its recommendation for use in this study.
Description
Thesis (M.Sc.) -- Istanbul Technical University, Graduate School, 2025
Subject
metabolic flux, metabolik akış, artificial intelligence, yapay zeka, machine learning, makine öğrenmesi