Yayın: Transformer and statistical models for LCSH assignment: a comparative study in digital libraries
Yükleniyor...
Dosyalar
Tarih
item.page.authors
Dergi Başlığı
Dergi ISSN
Cilt Başlığı
Yayımcı
Emerald Publishing
Özet
Purpose -
This study aims to examine the effectiveness of machine learning models and ensemble approaches for automating Library of Congress Subject Headings (LCSH) assignment to graduate theses and dissertations, aiming to enhance the efficiency, scalability and accuracy of library subject indexing in the digital age.
Design/methodology/approach -
A comparative quasi-experimental framework assessed five machine learning models (DeBERTa-v3-base, all-mpnet-base-v2, FastText, Omikuji Bonsai, term frequency-inverse document frequency [TF-IDF]) and two ensemble strategies (hybrid: DeBERTa + MPNet; ensemble: FastText + Omikuji Bonsai + TF-IDF) on a dataset of 1,104,600 thesis and dissertation titles across 1,578 LCSH labels, integrating organic and synthetic data. Synthetic titles were generated using large language models and rigorously validated to mitigate bias and prevent dataset imbalance. The performance was evaluated using F1, recall@5, NDCG@5, MRR and computational efficiency metrics (RAM usage and prediction time). Paired t-tests were conducted to confirm statistical significance of key performance differences.
Findings -
Transformer-based models (DeBERTa-v3-base: F1 0.7348; all-mpnet-base-v2: F1 0.7277) excelled in accuracy, whereas statistical models (e.g. FastText: 0.36 MiB, 0.0006 s) offered superior efficiency. The hybrid model achieved the highest F1 (0.7413) and NDCG@5 (0.8130) and the ensemble model led in recall@5 (0.8824), demonstrating the value of model integration. Ablation results showed that synthetic data substantially improved classification and ranking performance of models. Synthetic data improved dataset balance, enhancing model generalization.
Originality/value -
This study provides a novel comparison of transformer-based and statistical machine learning models for LCSH assignment, validated through both ablation and statistical significance testing, pioneering the use of synthetic data and probability-weighted ensembles to improve accuracy and ranking. It offers actionable insights for library automation, bridging gaps in prior research focused on narrower model sets.
Açıklama
Konusu
automatic subject indexing, Library of Congress subject headings, LCSH, machine learning, transformer models, ensemble approaches, synthetic data, library automation
Alıntı
Usta, G. (2025). "Transformer and statistical models for LCSH assignment: a comparative study in digital libraries". The Electronic Library. https://doi.org/10.1108/EL-03-2025-0102
