A dataset quality enhancement method for fine-grained just-in-time software defect prediction models
A dataset quality enhancement method for fine-grained just-in-time software defect prediction models
dc.contributor.advisor | Buzluca, Feza | |
dc.contributor.author | Fidandan, İrem | |
dc.contributor.authorID | 504211520 | |
dc.contributor.department | Computer Engineering | |
dc.date.accessioned | 2025-01-14T07:30:30Z | |
dc.date.available | 2025-01-14T07:30:30Z | |
dc.date.issued | 2024-07-01 | |
dc.description | Thesis (M.Sc.) -- İstanbul Technical University, Graduate School, 2024 | |
dc.description.abstract | Recently developed fine-grained JIT-SDP models offer individual defect-inducingness predictions for each changed file in commits, unlike conventional JIT-SDP models that only predict defect-inducingness for commits. These models also cost-effectively reduce the risk of missing defect-inducing changes in the effort-aware JIT-SDP models by allowing developers to review only defect-inducing files in a commit. Building machine learning models is a data-dependent process, so the quality of the data is crucial. Low data quality negatively affects the predictive performance, interpretability, and scalability of machine learning models. The novelty in the thesis is a two-phased method to improve the quality characteristics of the dataset, including uniqueness, validity, accuracy and relevance, based on the experience and observations in software development for fine-grained JIT-SDP models. In the first phase, miscalculated features are sometimes deleted and sometimes corrected under the right conditions to ensure uniqueness, validity and accuracy. In the second phase, file changes in the commits that have little or no impact on future defects are excluded from the dataset to provide relevance. The proposed data quality improvement method is applied on Trautsch et al.'s dataset. In the data set, there are two different automatically assigned labels by the SZZ algorithm, which includes two basic steps: identifying the bug-fix purpose in full changes or some blocks in changes, and backtracking for each deleted line bug-fix purpose locations to find the changes that previously added them. The reason for the label differences lies in the methods used to identify bug-fix purpose changes. Predictive performance improvements in fine grained JIT-SDP models are then demonstrated when the proposed data quality improvement method is used for within-project and cross-project settings. In within-project setting, time-sensitive validation approach is used. Time-sensitive validation approach first creates three-month training instance groups and one-month test instance groups based on ascending time order, then trains separate models for each instance group and measures their prediction performances, and finally takes arithmetic average of the predictive performance results to get an overall result. For both within-project and cross-project settings, two types of datasets are used: datasets with and without proposed data quality improvements. Model training and evaluation steps are performed for each combination of the features including JIT metrics, static code metrics, PMD static analyzer metrics, and all of them, as well as Adhoc and ITS labels. In addition, CFS is also applied to the dataset with data quality improvements to investigate whether better or same prediction performance can be achieved with cleaner and more explainable models. Random Forest is used for training with SMOTE to balance dataset. Predictive performances are assessed by F1 score. In both within-project and cross-project settings, proposed data quality improvements yielded higher F1 scores than the baseline. Additionally, in cross-project setting, CFS always increased F1 scores. So, the proposed data quality improvement method may help build better fine-grained JIT-SDP models. | |
dc.description.degree | M.Sc. | |
dc.identifier.uri | http://hdl.handle.net/11527/26196 | |
dc.language.iso | en_US | |
dc.publisher | Graduate School | |
dc.sdg.type | Goal 7: Affordable and Clean Energy | |
dc.sdg.type | Goal 12: Responsible Consumption and Production | |
dc.subject | data | |
dc.subject | veri | |
dc.subject | data quality | |
dc.subject | veri kalitesi | |
dc.subject | datasets | |
dc.subject | veri setleri | |
dc.title | A dataset quality enhancement method for fine-grained just-in-time software defect prediction models | |
dc.title.alternative | İnce taneli tam zamanında yazılım hata tahmin modelleri için veri kalitesi iyileştirme yöntemi | |
dc.type | Master Thesis |