Yayın:
Transformer and statistical models for LCSH assignment: a comparative study in digital libraries

dc.contributor.authorUsta, Gökhan
dc.contributor.authorID0000-0001-9206-8149
dc.contributor.departmentYapay Zeka ve Veri Mühendisliği
dc.date.accessioned2025-11-20T06:36:50Z
dc.date.available2025-11-20T06:36:50Z
dc.date.issued2025
dc.description.abstractPurpose - This study aims to examine the effectiveness of machine learning models and ensemble approaches for automating Library of Congress Subject Headings (LCSH) assignment to graduate theses and dissertations, aiming to enhance the efficiency, scalability and accuracy of library subject indexing in the digital age. Design/methodology/approach - A comparative quasi-experimental framework assessed five machine learning models (DeBERTa-v3-base, all-mpnet-base-v2, FastText, Omikuji Bonsai, term frequency-inverse document frequency [TF-IDF]) and two ensemble strategies (hybrid: DeBERTa + MPNet; ensemble: FastText + Omikuji Bonsai + TF-IDF) on a dataset of 1,104,600 thesis and dissertation titles across 1,578 LCSH labels, integrating organic and synthetic data. Synthetic titles were generated using large language models and rigorously validated to mitigate bias and prevent dataset imbalance. The performance was evaluated using F1, recall@5, NDCG@5, MRR and computational efficiency metrics (RAM usage and prediction time). Paired t-tests were conducted to confirm statistical significance of key performance differences. Findings - Transformer-based models (DeBERTa-v3-base: F1 0.7348; all-mpnet-base-v2: F1 0.7277) excelled in accuracy, whereas statistical models (e.g. FastText: 0.36 MiB, 0.0006 s) offered superior efficiency. The hybrid model achieved the highest F1 (0.7413) and NDCG@5 (0.8130) and the ensemble model led in recall@5 (0.8824), demonstrating the value of model integration. Ablation results showed that synthetic data substantially improved classification and ranking performance of models. Synthetic data improved dataset balance, enhancing model generalization. Originality/value - This study provides a novel comparison of transformer-based and statistical machine learning models for LCSH assignment, validated through both ablation and statistical significance testing, pioneering the use of synthetic data and probability-weighted ensembles to improve accuracy and ranking. It offers actionable insights for library automation, bridging gaps in prior research focused on narrower model sets.
dc.identifier.citationUsta, G. (2025). "Transformer and statistical models for LCSH assignment: a comparative study in digital libraries". The Electronic Library. https://doi.org/10.1108/EL-03-2025-0102
dc.identifier.urihttps://doi.org/10.1108/EL-03-2025-0102
dc.identifier.urihttp://hdl.handle.net/11527/27936
dc.language.isoen_US
dc.publisherEmerald Publishing
dc.relation.ispartofThe Electronic Library
dc.rights.licenseCC BY-NC 4.0
dc.sdg.typenone
dc.subjectautomatic subject indexing
dc.subjectLibrary of Congress subject headings
dc.subjectLCSH
dc.subjectmachine learning
dc.subjecttransformer models
dc.subjectensemble approaches
dc.subjectsynthetic data
dc.subjectlibrary automation
dc.titleTransformer and statistical models for LCSH assignment: a comparative study in digital libraries
dc.typeArticle
dspace.entity.typePublication

Dosyalar

Orijinal seri

Şimdi gösteriliyor 1 - 1 / 1
Yükleniyor...
Küçük Resim
Ad:
EL-03-2025-0102.pdf
Boyut:
2.32 MB
Format:
Adobe Portable Document Format

Lisanslı seri

Şimdi gösteriliyor 1 - 1 / 1
Yükleniyor...
Küçük Resim
Ad:
license.txt
Boyut:
1.58 KB
Format:
Item-specific license agreed upon to submission
Açıklama