Train set complexity tunning for imbalance learning

thumbnail.default.alt
Tarih
2024-05-17
Yazarlar
Ulaş, Mehmet
Süreli Yayın başlığı
Süreli Yayın ISSN
Cilt Başlığı
Yayınevi
Graduate School
Özet
Machine learning algorithms that address most classification problems are known to yield good results by assuming a balanced training set. This is because machine learning algorithms, during the training process, attempt to minimize a specific cost objective function. When undertaking this minimization process, if one class is predominant, it will tend to dominate the minimization process, essentially producing predictions biased toward the dominant class. In other words, generating predictions predominantly for the prevalent class will fulfill the minimization objective. This presents a problematic situation, as many real-world classification problems exhibit imbalanced datasets. Real-life scenarios such as fraud detection, churn prediction, and disease diagnosis often involve imbalanced datasets. To address this issue, the desired outcome in the development of classification solutions is the accurate identification of examples belonging to the minority class. In the context of these problems, the detection of instances from the minority class is typically more meaningful. Hence, the resolution of the imbalance problem in classification algorithms holds significant importance. Imbalance in real-world classification problems poses a challenge, as it hinders the proper recognition of instances from the minority class. This is particularly noteworthy in scenarios like fraud detection, churn prediction, and disease diagnosis, where the minority class represents critical instances requiring accurate identification. Therefore, addressing the imbalance problem in classification algorithms becomes crucial for the successful application of machine learning in these real-life scenarios. Various methods have been developed to address the imbalance learning problem, and these methods can manifest as manipulations at both the model level and the level of the model's learning set. Manipulating the learning set aims to create a relatively balanced dataset during the learning process. One approach involves removing some examples belonging to the majority class from the learning set to achieve this balance. However, this method may lead to information loss related to the majority class. In addition to discarding examples from the majority class, another strategy involves augmenting examples from the minority class. During this process, sampling can be performed for the minority class. A fundamental challenge in sampling from the minority class lies in the difficulty of ensuring diversity in the information represented by minority class examples, leading to the problem of incorporating repetitive information into the dataset. To address this issue, solutions have been proposed that focus on generating synthetic data for the minority class. Unlike the previously introduced solutions for the imbalance learning problem, we propose a novel approach: training set complexity ratio tuning. This proposed method diverges from traditional techniques by emphasizing the adjustment of the complexity ratio within the training set. Instead of introducing synthetic data, our approach centers around iteratively tuning the complexity of the training set. This involves a careful balance between reducing examples from the majority class and augmenting examples from the minority class to achieve a favorable training set complexity ratio. By doing so, we aim to mitigate information loss in the majority class while enhancing the diversity and quantity of minority class examples, providing a nuanced perspective in addressing the challenges posed by the imbalance learning problem. In proposed method, we will develop a solution to make classification algorithms stronger by tuning the complexity of the train set for imbalanced data sets. The complexity of the training set refers to the indicator of how many times the number of examples in the majority classs exceds the those in the majority class. We consider the complexity of the training set as a hyperparameter, crutial for effectively addressing imbalance learning problem. To achieve this, tunning the hyperparameter of the training set's complexity becomes imperative. During this process, an equal number of examples from the minority and the majority class are initially selected. Subsequently, an iterative approach is employed to incrementally increase the number of examples frım the majority class. At each iteration base model is trained on the resulting training set, and its performance on the validation set is measured and recorded. Comparing the performance metrics obtained after each iteration allows us to determined the optimal complexity of the training set, which corresponds to the iteration with highest performance. While adjusting the balance of majority class and minority class in the training set, sample selection will be made by considering the number of sample of the majority class. We aimed to enhance the representativeness of the selected samples from the majority class by employing the K-Means algorithms, instead of opting for random selection. To achieve this, we partitioned the majorit class into clusters using the K-Means algorithm. During the sample selection process, we considered the propotion of instances within each cluster, ensuring that the selected samples maintained the same ratio as the original data. This approach was designed to improve the overall representativeness of the majority class, ultimately contributing to a more robust and reliable sampling strategy. In our proposed method, in addition to tuning the complexity ratio of the training set, we compared methods for determining the ideal number of clusters for the KMeans clustering applied to the majority class. Initially, we employed the commonly used elbow curve method to identify the number of clusters for the majority class. As a second approach, for each cluster count, the train set complexity method was executed to find the optimal training set complexity ratio, and the performance of the resulting model on the validation set was recorded. This method was iteratively applied for each cluster count. The obtained results were then compared, and the number of clusters for the majority class and the corresponding training set complexity ratio were determined. Using the accuracy metric would be misleading when measuring the performance of the model built on the imbalance data set (Chowdhury et al., 2023). In the comparative analysis between our proposed model and the conventional SMOTE method, the ROC performance metric was chosen as the evaluation criterion. This metric, renowned for quantifying the discriminative prowess of a classification algorithm, proved instrumental in assessing the algorithm's capacity to effectively differentiate between distinct classes. Upon scrutinizing the results of the comparison, it is evident that the methodology we proposed has yielded more favorable outcomes when juxtaposed against the conventional SMOTE approach. The ROC performance metric, being a comprehensive indicator of the algorithm's ability to discriminate between positive and negative instances, underscores the enhanced discriminative power exhibited by our proposed model. This comparative evaluation provides empirical evidence supporting the superior efficacy of our proposed methodology over the traditional SMOTE method, underscoring its potential to enhance the overall performance and robustness of classification algorithms, particularly in the context of imbalanced datasets. In addition, to comprehensively assess the validation performance of the proposed method, we conducted a detailed examination of the confusion matrix. This analysis revealed a significant increase in the True Positive (TP) rate, which indicates that our model exhibits a superior ability to accurately predict positive classifications. This enhancement in the True Positive rate is particularly noteworthy, as it underscores the model's proficiency in identifying positive instances, thereby reducing false negatives and improving overall classification accuracy. These findings are critical as they demonstrate that the proposed method not only enhances overall model performance but also significantly improves its efficacy in specific classification tasks. The ability to accurately classify positive instances is often crucial in many real-world applications, such as medical diagnosis, fraud detection, and various safety-critical systems. Moreover, the results highlight the robustness and reliability of our model when applied to imbalanced datasets, a common challenge in many practical scenarios. Traditional methods often struggle with such datasets, leading to suboptimal performance. However, the proposed method stands out by offering a more effective and reliable solution. In summary, our newly proposed model presents a substantial advancement over other widely used techniques, providing a more effective and dependable approach to handling imbalanced data. This improvement not only demonstrates the model's potential for broader applicability but also its robustness in addressing complex and demanding classification problems. The implications of these findings are significant, suggesting that the proposed method could be widely adopted across various domains requiring precise and reliable classification performance.
Açıklama
Thesis (M.Sc.) -- Istanbul Technical University, Graduate School, 2024
Anahtar kelimeler
Machine learning, Makine öğrenmesi, Imbalance learning, Dengesiz öğrenme
Alıntı