New proposed methods for synthetic minority over-sampling technique

thumbnail.default.alt
Tarih
2024-08-21
Yazarlar
Korul, Hakan
Süreli Yayın başlığı
Süreli Yayın ISSN
Cilt Başlığı
Yayınevi
Graduate School
Özet
Machine Learning (ML), artificial intelligence (AI), aims to mimic human learning processes using data and algorithms. Machine learning algorithms are trained on data, which can be labeled or unlabeled, with the aim of finding patterns by considering each data variable and its corresponding label. Algorithms trained on data predict future data labels based on patterns learned from previously observed data variables. In machine learning algorithms, there is an error function that evaluates the algorithm's performance on the dataset, and to improve performance, the algorithm tries to find the one with a lower error value than the error function obtained previously. The proliferation of ML algorithms and their utilization across various fields in the literature has led to the resolution of many challenging problems through machine learning. These problems can generally be categorized as supervised and unsupervised learning. Supervised learning, where data labels are known, can be further classified into regression and classification. Regression algorithms are used for problems such as predicting stock prices, predicting house prices, and forecasting weather conditions, while classification algorithms are employed for tasks like customer churn prediction, email spam detection, and image recognition. Problems where labels are not available in the dataset fall under unsupervised learning, which typically includes tasks like customer segmentation. Ensuring machines are trained properly and achieve high performance relies heavily on the quality of the data. The data used for machine learning ideally should be clean, balanced, and representative of real-world conditions. Additionally, the size of the dataset is crucial, as larger datasets usually lead to better results. Alongside these data characteristics, having balanced classes in classification problems is vital for classification machine learning algorithms. When it comes to classification problems, having balanced class distributions in the data is crucial for classification machine learning algorithms. If the class distribution is imbalanced, it leads to what is known as an imbalanced dataset. Training a machine learning model on a problem with balanced classes as opposed to imbalanced classes can result in noticeable differences in performance. For instance, in a problem with balanced classes, where 4 out of 10 customers churn, meaning almost a 50% probability, even a model with random guessing can make successful predictions. However, if the problem is such that only 1 out of 100 customers churn, random guessing becomes insufficient, posing an extremely challenging problem for machine learning models. To address this issue, various methods have been developed. In the example mentioned above, where 99 customers do not churn and 1 customer churns, these problems are typically categorized into majority and minority classes. These different methods can essentially be classified into three categories: increasing the number of minority class samples to approach the number of majority class samples (oversampling), decreasing the number of majority class samples to approach the number of minority class samples (undersampling), or using a combination of both methods (hybrid sampling). Directly comparing oversampling and undersampling methods can be quite challenging, as these methods may perform differently on different datasets. However, certain characteristics of the dataset can help expedite the selection of the appropriate method. For example, if the dataset is not very large, undersampling may not be suitable because it would eliminate majority class samples, resulting in a significantly smaller dataset. In such cases, oversampling methods that increase the overall dataset size should be preferred. Oversampling methods can be further divided into random oversampling and SMOTE. The most significant difference between the two methods lies in how they handle the generation of synthetic data. While random oversampling duplicates existing minority class samples in the dataset, SMOTE generates synthetic data points close to existing minority class samples using various techniques. Since random oversampling does not introduce new examples to the machine learning algorithm and can lead to overfitting, SMOTE methods generally yield better performance improvements. In the data generation stage with the SMOTE method, first, the number of k nearest neighbors is determined for each sample. To generate synthetic data, two examples are selected: the first is the minority sample itself, and the second is a randomly chosen minority class example among its k nearest neighbors. Then, the difference between the neighbor example and the minority example is calculated. Subsequently, this difference is multiplied by a random number between 0 and 1 and added back to the minority sample. This process ensures that the newly generated synthetic data is placed between the minority sample and one of its k nearest neighbors randomly selected. Following the emergence of the SMOTE method in the literature, efforts have been focused on developing this method from various perspectives. Borderline SMOTE and K-Means SMOTE are just some of the newly proposed methods in this context. In the Borderline SMOTE method, a border between the minority and majority classes is defined, and synthetic data generation is focused only on the minority samples on this borderline. In the K-Means SMOTE method, the K-means technique is utilized, where the training set is divided into k clusters, and then decisions are made on which clusters to use for oversampling, followed by applying SMOTE steps within these clusters. In my research, I developed three different new SMOTE methods. The first one, Genetic SMOTE, focuses on variable alteration using the crossover logic commonly used in genetic algorithms between a sample and its neighbor. Randomly selected features of the sample are chosen, then these features are taken from the neighbor, while the remaining features are directly taken from the sample, resulting in the generation of new synthetic data of the same size. The second one, Dual Borderline SMOTE, is inspired by the Borderline SMOTE method. Instead of considering the ratio of minority samples within the total number of neighbors applied for Borderline, which ranges between 0.5 and 1, I adjusted it to range between 0 and 1. The significant difference between this method and Borderline SMOTE is that, while Borderline SMOTE does not consider whether the neighbors of the minority sample are on the border after finding a minority sample on the border, in this method, both the sample and its neighbor must be on the newly defined border. The last method developed is Genetic Dual Borderline SMOTE, which combines these two methods. This method applies the Dual Borderline SMOTE rules for selecting minority samples and their close minority sample neighbors, meaning that both the selected sample and its neighbor must lie on the defined borderline. When generating synthetic data from the selected two examples, instead of applying SMOTE as in the Dual Borderline SMOTE method, it focuses on variable alteration similar to the Genetic SMOTE steps. In the final section, to compare the performance of the three developed methods with the other three methods used, 8 datasets and 4 different machine learning algorithms were employed. For each dataset and algorithm, a total of 6 different SMOTE methods were evaluated for performance. During performance comparison, various parameter combinations of machine learning algorithms and SMOTE methods were tested for each dataset, and the parameters yielding the best results were selected for each dataset, model, and SMOTE method. The success of the SMOTE methods was compared using the parameters that provided the best F1 score. In the performance measurement, the F1 score of the minority class was preferred over metrics like accuracy, which are not suitable for imbalanced datasets. To reduce randomness and ensure a more reliable model, a 5-fold cross-validation score was used to calculate the F1 score. A total of 32 F1 scores (8 datasets, 4 machine learning algorithms) were obtained for each SMOTE method, and the F1 scores were illustrated in figures either overall or based on the model. Additionally, each dataset and machine learning algorithm were ranked from 1 to 6 for SMOTE methods, and the rankings were compared in the study. It has been observed that the newly developed methods outperform the existing ones. In model-based analyses, while there is not a significant difference in the performance of SMOTE methods in linear models, especially in ensemble methods, the Genetic Dual Borderline SMOTE method stands out noticeably from the others.
Açıklama
Thesis (M.Sc.) -- Istanbul Technical University, Graduate School, 2024
Anahtar kelimeler
machine learning, makine öğrenmesi, artificial intelligence, yapay zeka
Alıntı