Semi-supervised learning strategy for improved flash point prediction
Semi-supervised learning strategy for improved flash point prediction
dc.contributor.advisor | Öğüdücü, Şule | |
dc.contributor.author | Sülük, Mert | |
dc.contributor.authorID | 504201527 | |
dc.contributor.department | Computer Engineering | |
dc.date.accessioned | 2025-06-17T06:24:43Z | |
dc.date.available | 2025-06-17T06:24:43Z | |
dc.date.issued | 2024-08-20 | |
dc.description | Thesis (M.Sc.) -- Istanbul Technical University, Graduate School, 2024 | |
dc.description.abstract | This thesis explores the application of semi-supervised learning techniques to enhance the prediction of flash points in the oil industry, which are critical for ensuring the safety of transporting and storing petroleum products. Flash points denote the lowest temperature at which a substance's vapors ignite in air, a crucial parameter that traditional methods ascertain through costly and time-consuming laboratory tests. This study proposes a data-driven approach to optimize these processes more efficiently and effectively. Semi-supervised learning, which leverages both labeled and unlabeled data, provides a robust framework especially valuable in scenarios where data labeling is prohibitively expensive or logistically challenging. This research integrates sensor data such as pressure, temperature, and flow rates with sparse flash point measurements to develop a predictive model. The aim is to reduce dependency on extensive laboratory testing while enhancing operational efficiency and safety protocols. The central research questions addressed are: How can flash points be accurately predicted in the oil industry when only a limited number of labeled data points are available? Given these constraint, could semi-supervised learning method be an effective solution? What are the specific advantages and limitations of these technique within the oil industry context? The study validates the effectiveness of semi-supervised learning method and develops a model that improves upon traditional approaches. To address the research questions, particularly in the context of improving flash point predictions with limited labeled data, the study employs data preprocessing techniques and modeling processes that are essential for optimizing model performance. The methodology employs two principal data preprocessing techniques: Winsorization and Min-Max Scaling. Winsorization mitigates the effects of outliers by limiting extreme data points within a designated percentile range, ensuring the model is not skewed by anomalies. Min-Max Scaling normalizes the data, allowing for equitable evaluation of all features and preventing any single feature from dominating the model's output. The modeling process involves the Gaussian Process Regressor and the Random Forest model. The Gaussian Process Regressor, suitable for continuous data, provides uncertainty estimates to gauge the reliability of predictions. The Random Forest model enhances stability and accuracy by aggregating predictions from multiple decision trees. Initially trained on labeled data, the Gaussian Process Regressor subsequently predicts labels for unlabeled data, incorporating those predictions within a specified confidence interval into the training set. This expanding dataset further trains the Random Forest model, applying an expanding window approach to incrementally improve prediction capabilities. Performance metrics such as Mean Absolute Error and Root Mean Squared Error assess model efficacy. The baseline model initially yielded an mean absolute error of 1.1 degrees in flash point predictions. With the application of the semi-supervised learning model, Mean Absolute Error improved to 1.01 and Root Mean Squared Error decreased to 1.63, demonstrating significant enhancements in accuracy through the inclusion of unlabeled data. In conclusion, this thesis illustrates the potential of semi-supervised learning to bridge the gap caused by a scarcity of labeled data, particularly in critical industrial applications like oil processing. The findings suggest that semi-supervised learning not only reduces the financial and temporal expenditures associated with traditional testing methods but also offers a scalable, efficient alternative poised to transform industry practices. The methodologies developed here have broader implications, suggesting that semi-supervised learning could be similarly beneficial in other sectors where data labeling is a significant constraint and even small performance improvements are critical due to the importance of the parameters being predicted. | |
dc.description.degree | M.Sc. | |
dc.identifier.uri | http://hdl.handle.net/11527/27319 | |
dc.language.iso | en_US | |
dc.publisher | Graduate School | |
dc.sdg.type | Goal 9: Industry, Innovation and Infrastructure | |
dc.subject | Prediction | |
dc.subject | Machine learning | |
dc.subject | Petroleum | |
dc.subject | Random forests | |
dc.subject | Sensors | |
dc.title | Semi-supervised learning strategy for improved flash point prediction | |
dc.title.alternative | Parlama noktası tahminini iyileştirmek için yarı denetimli öğrenme stratejisi | |
dc.type | Master Thesis |