A dataset quality enhancement method for fine-grained just-in-time software defect prediction models

Fidandan, İrem

A dataset quality enhancement method for fine-grained just-in-time software defect prediction models

dc.contributor.advisor	Buzluca, Feza
dc.contributor.author	Fidandan, İrem
dc.contributor.authorID	504211520
dc.contributor.department	Computer Engineering
dc.date.accessioned	2025-01-14T07:30:30Z
dc.date.available	2025-01-14T07:30:30Z
dc.date.issued	2024-07-01
dc.description	Thesis (M.Sc.) -- İstanbul Technical University, Graduate School, 2024
dc.description.abstract	Recently developed fine-grained JIT-SDP models offer individual defect-inducingness predictions for each changed file in commits, unlike conventional JIT-SDP models that only predict defect-inducingness for commits. These models also cost-effectively reduce the risk of missing defect-inducing changes in the effort-aware JIT-SDP models by allowing developers to review only defect-inducing files in a commit. Building machine learning models is a data-dependent process, so the quality of the data is crucial. Low data quality negatively affects the predictive performance, interpretability, and scalability of machine learning models. The novelty in the thesis is a two-phased method to improve the quality characteristics of the dataset, including uniqueness, validity, accuracy and relevance, based on the experience and observations in software development for fine-grained JIT-SDP models. In the first phase, miscalculated features are sometimes deleted and sometimes corrected under the right conditions to ensure uniqueness, validity and accuracy. In the second phase, file changes in the commits that have little or no impact on future defects are excluded from the dataset to provide relevance. The proposed data quality improvement method is applied on Trautsch et al.'s dataset. In the data set, there are two different automatically assigned labels by the SZZ algorithm, which includes two basic steps: identifying the bug-fix purpose in full changes or some blocks in changes, and backtracking for each deleted line bug-fix purpose locations to find the changes that previously added them. The reason for the label differences lies in the methods used to identify bug-fix purpose changes. Predictive performance improvements in fine grained JIT-SDP models are then demonstrated when the proposed data quality improvement method is used for within-project and cross-project settings. In within-project setting, time-sensitive validation approach is used. Time-sensitive validation approach first creates three-month training instance groups and one-month test instance groups based on ascending time order, then trains separate models for each instance group and measures their prediction performances, and finally takes arithmetic average of the predictive performance results to get an overall result. For both within-project and cross-project settings, two types of datasets are used: datasets with and without proposed data quality improvements. Model training and evaluation steps are performed for each combination of the features including JIT metrics, static code metrics, PMD static analyzer metrics, and all of them, as well as Adhoc and ITS labels. In addition, CFS is also applied to the dataset with data quality improvements to investigate whether better or same prediction performance can be achieved with cleaner and more explainable models. Random Forest is used for training with SMOTE to balance dataset. Predictive performances are assessed by F1 score. In both within-project and cross-project settings, proposed data quality improvements yielded higher F1 scores than the baseline. Additionally, in cross-project setting, CFS always increased F1 scores. So, the proposed data quality improvement method may help build better fine-grained JIT-SDP models.
dc.description.degree	M.Sc.
dc.identifier.uri	http://hdl.handle.net/11527/26196
dc.language.iso	en_US
dc.publisher	Graduate School
dc.sdg.type	Goal 7: Affordable and Clean Energy
dc.sdg.type	Goal 12: Responsible Consumption and Production
dc.subject	data
dc.subject	veri
dc.subject	data quality
dc.subject	veri kalitesi
dc.subject	datasets
dc.subject	veri setleri
dc.title	A dataset quality enhancement method for fine-grained just-in-time software defect prediction models
dc.title.alternative	İnce taneli tam zamanında yazılım hata tahmin modelleri için veri kalitesi iyileştirme yöntemi
dc.type	Master Thesis

Dosyalar

Orijinal seri

Şimdi gösteriliyor 1 - 1 / 1

Ad:: 504211520.pdf
Boyut:: 359.97 KB
Format:: Adobe Portable Document Format

İndir

Lisanslı seri

Şimdi gösteriliyor 1 - 1 / 1

Ad:: license.txt
Boyut:: 1.58 KB
Format:: Item-specific license agreed upon to submission
Açıklama

İndir

Koleksiyonlar

LEE- Bilgisayar Mühendisliği-Yüksek Lisans