A dataset quality enhancement method for fine-grained just-in-time software defect prediction models

thumbnail.default.alt
Tarih
2024-07-01
Yazarlar
Fidandan, İrem
Süreli Yayın başlığı
Süreli Yayın ISSN
Cilt Başlığı
Yayınevi
Graduate School
Özet
Recently developed fine-grained JIT-SDP models offer individual defect-inducingness predictions for each changed file in commits, unlike conventional JIT-SDP models that only predict defect-inducingness for commits. These models also cost-effectively reduce the risk of missing defect-inducing changes in the effort-aware JIT-SDP models by allowing developers to review only defect-inducing files in a commit. Building machine learning models is a data-dependent process, so the quality of the data is crucial. Low data quality negatively affects the predictive performance, interpretability, and scalability of machine learning models. The novelty in the thesis is a two-phased method to improve the quality characteristics of the dataset, including uniqueness, validity, accuracy and relevance, based on the experience and observations in software development for fine-grained JIT-SDP models. In the first phase, miscalculated features are sometimes deleted and sometimes corrected under the right conditions to ensure uniqueness, validity and accuracy. In the second phase, file changes in the commits that have little or no impact on future defects are excluded from the dataset to provide relevance. The proposed data quality improvement method is applied on Trautsch et al.'s dataset. In the data set, there are two different automatically assigned labels by the SZZ algorithm, which includes two basic steps: identifying the bug-fix purpose in full changes or some blocks in changes, and backtracking for each deleted line bug-fix purpose locations to find the changes that previously added them. The reason for the label differences lies in the methods used to identify bug-fix purpose changes. Predictive performance improvements in fine grained JIT-SDP models are then demonstrated when the proposed data quality improvement method is used for within-project and cross-project settings. In within-project setting, time-sensitive validation approach is used. Time-sensitive validation approach first creates three-month training instance groups and one-month test instance groups based on ascending time order, then trains separate models for each instance group and measures their prediction performances, and finally takes arithmetic average of the predictive performance results to get an overall result. For both within-project and cross-project settings, two types of datasets are used: datasets with and without proposed data quality improvements. Model training and evaluation steps are performed for each combination of the features including JIT metrics, static code metrics, PMD static analyzer metrics, and all of them, as well as Adhoc and ITS labels. In addition, CFS is also applied to the dataset with data quality improvements to investigate whether better or same prediction performance can be achieved with cleaner and more explainable models. Random Forest is used for training with SMOTE to balance dataset. Predictive performances are assessed by F1 score. In both within-project and cross-project settings, proposed data quality improvements yielded higher F1 scores than the baseline. Additionally, in cross-project setting, CFS always increased F1 scores. So, the proposed data quality improvement method may help build better fine-grained JIT-SDP models.
Açıklama
Thesis (M.Sc.) -- İstanbul Technical University, Graduate School, 2024
Anahtar kelimeler
data, veri, data quality, veri kalitesi, datasets, veri setleri
Alıntı