A dataset quality enhancement method for fine-grained just-in-time software defect prediction models

dc.contributor.advisorBuzluca, Feza
dc.contributor.authorFidandan, İrem
dc.contributor.authorID504211520
dc.contributor.departmentComputer Engineering
dc.date.accessioned2025-01-14T07:30:30Z
dc.date.available2025-01-14T07:30:30Z
dc.date.issued2024-07-01
dc.descriptionThesis (M.Sc.) -- İstanbul Technical University, Graduate School, 2024
dc.description.abstractRecently developed fine-grained JIT-SDP models offer individual defect-inducingness predictions for each changed file in commits, unlike conventional JIT-SDP models that only predict defect-inducingness for commits. These models also cost-effectively reduce the risk of missing defect-inducing changes in the effort-aware JIT-SDP models by allowing developers to review only defect-inducing files in a commit. Building machine learning models is a data-dependent process, so the quality of the data is crucial. Low data quality negatively affects the predictive performance, interpretability, and scalability of machine learning models. The novelty in the thesis is a two-phased method to improve the quality characteristics of the dataset, including uniqueness, validity, accuracy and relevance, based on the experience and observations in software development for fine-grained JIT-SDP models. In the first phase, miscalculated features are sometimes deleted and sometimes corrected under the right conditions to ensure uniqueness, validity and accuracy. In the second phase, file changes in the commits that have little or no impact on future defects are excluded from the dataset to provide relevance. The proposed data quality improvement method is applied on Trautsch et al.'s dataset. In the data set, there are two different automatically assigned labels by the SZZ algorithm, which includes two basic steps: identifying the bug-fix purpose in full changes or some blocks in changes, and backtracking for each deleted line bug-fix purpose locations to find the changes that previously added them. The reason for the label differences lies in the methods used to identify bug-fix purpose changes. Predictive performance improvements in fine grained JIT-SDP models are then demonstrated when the proposed data quality improvement method is used for within-project and cross-project settings. In within-project setting, time-sensitive validation approach is used. Time-sensitive validation approach first creates three-month training instance groups and one-month test instance groups based on ascending time order, then trains separate models for each instance group and measures their prediction performances, and finally takes arithmetic average of the predictive performance results to get an overall result. For both within-project and cross-project settings, two types of datasets are used: datasets with and without proposed data quality improvements. Model training and evaluation steps are performed for each combination of the features including JIT metrics, static code metrics, PMD static analyzer metrics, and all of them, as well as Adhoc and ITS labels. In addition, CFS is also applied to the dataset with data quality improvements to investigate whether better or same prediction performance can be achieved with cleaner and more explainable models. Random Forest is used for training with SMOTE to balance dataset. Predictive performances are assessed by F1 score. In both within-project and cross-project settings, proposed data quality improvements yielded higher F1 scores than the baseline. Additionally, in cross-project setting, CFS always increased F1 scores. So, the proposed data quality improvement method may help build better fine-grained JIT-SDP models.
dc.description.degreeM.Sc.
dc.identifier.urihttp://hdl.handle.net/11527/26196
dc.language.isoen_US
dc.publisherGraduate School
dc.sdg.typeGoal 7: Affordable and Clean Energy
dc.sdg.typeGoal 12: Responsible Consumption and Production
dc.subjectdata
dc.subjectveri
dc.subjectdata quality
dc.subjectveri kalitesi
dc.subjectdatasets
dc.subjectveri setleri
dc.titleA dataset quality enhancement method for fine-grained just-in-time software defect prediction models
dc.title.alternativeİnce taneli tam zamanında yazılım hata tahmin modelleri için veri kalitesi iyileştirme yöntemi
dc.typeMaster Thesis

Dosyalar

Orijinal seri

Şimdi gösteriliyor 1 - 1 / 1
Yükleniyor...
Küçük Resim
Ad:
504211520.pdf
Boyut:
359.97 KB
Format:
Adobe Portable Document Format

Lisanslı seri

Şimdi gösteriliyor 1 - 1 / 1
Yükleniyor...
Küçük Resim
Ad:
license.txt
Boyut:
1.58 KB
Format:
Item-specific license agreed upon to submission
Açıklama