Penalized stable regression
Penalized stable regression
Dosyalar
Tarih
2024-06-24
Yazarlar
Sarıbaş, İrem
Süreli Yayın başlığı
Süreli Yayın ISSN
Cilt Başlığı
Yayınevi
Graduate School
Özet
In machine learning, the process of data splitting is critical for developing both accurate and consistent models. This process involves dividing the data into separate sets for training, validation, and testing. The training set enables the training of models, the validation set assists in selecting the best parameters, and the test data allows for the assessment of performance of the model in real-world scenarios. Various data splitting techniques exist, each serving specific characteristics of the data set and modeling objectives, such as one-time split and k-fold cross-validation. In the method of one-time split, the data set is randomly divided into two subsets at a predetermined ratio. In the method of k-fold cross-validation, the data set is randomly divided into k equal parts. The model is trained on k-1 parts, and the remaining part is used for validation or testing. This process is repeated such that each part is used for training exactly once. Over-fitting is an issue in machine learning where a model learns the details and noise in the training data to an extent that adversely affects its performance on previously unseen data. Regularized regression methods play a crucial role in addressing the problem of over-fitting, especially for models that perform excellently on training data but fail on new and previously unseen data. Techniques such as Ridge regression, the Least Absolute Shrinkage and Selection Operator (LASSO), Smoothly Clipped Absolute Deviation (SCAD), and the Minimax Concave Penalty (MCP) hold significant places in enhancing model training. By penalizing the coefficients of features, these methods help reduce over-fitting, encouraging the development of simpler models because such models are more likely to generalize better to new data sets. The penalty implemented by Ridge regression is proportional to the sum of the squares of the coefficients, which reduces their effect while retaining all features in the model, but does not eliminate any feature completely. LASSO aims both to shrink regression coefficients and to remove insignificant features from the model. It employs the sum of the absolute values of the coefficients as the penalty term. This method zeroes out the coefficients of insignificant features, thereby automatically performing feature selection. SCAD applies a penalty similar to that of LASSO to small coefficients but avoids penalizing large coefficients, allowing the model to retain large coefficients that are significantly different from zero. MCP is a method developed to address variable selection in high-dimensional data, offering a non-convex penalty mechanism and promoting sparse solutions while penalizing large coefficient values with less bias, thus reducing the effect on large coefficients differently than Ridge. In this thesis, we propose an optimization-based algorithmic data splitting method to effectively select training and validation sets. Our proposed method systematically assigns data points to training or validation sets based on their contributions to the performance of the model. If the contribution of a data point to the performance of the model is high, it is placed in the training set; if low, in the validation set. In this study, the proposed approach is tested on various regression models using penalties such as Ridge, LASSO, SCAD, and MCP. The approach is compared with traditional data splitting techniques like one-time split method and k-fold cross-validation, applied to two different data sets using various evaluation metrics. Each data splitting scenario has been repeated one thousand times to ensure the consistency of the results and to obtain statistically reliable outcomes. The evaluation metrics include the runtime, the average value of the regularization parameter lambda, the standard deviation of the regularization parameter lambda, errors in prediction for the validation, training, and test sets, average coefficients, and the standard deviation of the coefficients. In the scenario of one-time split method, the data set is randomly divided such that 80% of the observations are in the training and validation set, and 20% are in the test set. Then, based on a predetermined ratio, the training and validation sets are further randomly split. Models are constructed using these sets, and performance is measured. In the scenario of k-fold cross-validation, the data set is randomly divided so that 80% of the observations are in the training and validation set, and 20% are in the test set. The training and validation set is then is divided into k equal parts. k-1 parts are used for training while the remaining part is used for validation. This process is repeated k times, each time with a different part used as the validation set, and the performance of the models is measured. In the scenario evaluating the optimization-based data splitting approach, the data set is randomly divided so that 80% of the observations are in the training and validation set, and 20% are in the test set. Then, considering the contribution of each data point to the performance of the model, the training and validation sets are split according to a predetermined ratio. Models are built using these sets, and their performance is measured. The findings obtained from the tests conducted over the mentioned scenarios are as follows: When evaluated in terms of the runtime, the proposed method, although requiring more time compared to the method of single random splitting, provides effective results in similar or less time when compared to k-fold cross-validation. This situation becomes more pronounced especially when working with large data sets or complex models. Our method optimizes data splitting processes, balancing the cost of time while maximizing the accuracy and performance of the model. In terms of the average value of the regularization parameter lambda, variability in lambda values across different scenarios indicates that regularization methods such as Ridge and SCAD significantly impact the fit of the models to the data. In the case of LASSO, low lambda values result in outcomes similar to those of unregularized regression models, suggesting a minimal impact of the regularization. When evaluating the standard deviation of the regularization parameter lambda, the proposed method reduces the standard deviations of lambda values, ensuring more consistent data fit by the model. This reduction indicates an enhancement in the generalization ability of the model and a better fit to the data. In terms of prediction errors (MSE) evaluated scenario-wise, the proposed method maintains consistency in MSE values across the validation, training, and test sets. Notably, tests conducted with both k-fold cross-validation and the proposed optimization approach enhance the generalization capacity of the model, offering the lowest MSE values. The results demonstrate that the proposed optimization-based data splitting method can produce models with prediction errors comparable to, and in some cases more successful than, those developed using k-fold cross-validation. When compared in terms of computational cost, the optimization-based data splitting method appears to be more advantageous than the time spent on k-fold cross-validation. Furthermore, the models have exhibited significantly lower standard deviations in predictions, model coefficients, and hyperparameters. This indicates a marked increase in model stability and suggests that the proposed method can contribute to the development of more reliable and consistent machine learning models. These findings offer promising perspectives on the applicability and effectiveness of the method.
Açıklama
Thesis (M.Sc.) -- İstanbul Technical University, Graduate School, 2024
Anahtar kelimeler
machine learning,
makine öğrenmesi,
statistical analysis,
istatistiksel analiz