Yayın:
Transformer and statistical models for LCSH assignment: a comparative study in digital libraries

Yükleniyor...
Küçük Resim

Tarih

item.page.authors

Dergi Başlığı

Dergi ISSN

Cilt Başlığı

Yayımcı

Emerald Publishing

item.page.projects

item.page.org-units

item.page.journal-issue

Özet

Purpose - This study aims to examine the effectiveness of machine learning models and ensemble approaches for automating Library of Congress Subject Headings (LCSH) assignment to graduate theses and dissertations, aiming to enhance the efficiency, scalability and accuracy of library subject indexing in the digital age. Design/methodology/approach - A comparative quasi-experimental framework assessed five machine learning models (DeBERTa-v3-base, all-mpnet-base-v2, FastText, Omikuji Bonsai, term frequency-inverse document frequency [TF-IDF]) and two ensemble strategies (hybrid: DeBERTa + MPNet; ensemble: FastText + Omikuji Bonsai + TF-IDF) on a dataset of 1,104,600 thesis and dissertation titles across 1,578 LCSH labels, integrating organic and synthetic data. Synthetic titles were generated using large language models and rigorously validated to mitigate bias and prevent dataset imbalance. The performance was evaluated using F1, recall@5, NDCG@5, MRR and computational efficiency metrics (RAM usage and prediction time). Paired t-tests were conducted to confirm statistical significance of key performance differences. Findings - Transformer-based models (DeBERTa-v3-base: F1 0.7348; all-mpnet-base-v2: F1 0.7277) excelled in accuracy, whereas statistical models (e.g. FastText: 0.36 MiB, 0.0006 s) offered superior efficiency. The hybrid model achieved the highest F1 (0.7413) and NDCG@5 (0.8130) and the ensemble model led in recall@5 (0.8824), demonstrating the value of model integration. Ablation results showed that synthetic data substantially improved classification and ranking performance of models. Synthetic data improved dataset balance, enhancing model generalization. Originality/value - This study provides a novel comparison of transformer-based and statistical machine learning models for LCSH assignment, validated through both ablation and statistical significance testing, pioneering the use of synthetic data and probability-weighted ensembles to improve accuracy and ranking. It offers actionable insights for library automation, bridging gaps in prior research focused on narrower model sets.

Açıklama

Konusu

automatic subject indexing, Library of Congress subject headings, LCSH, machine learning, transformer models, ensemble approaches, synthetic data, library automation

Alıntı

Usta, G. (2025). "Transformer and statistical models for LCSH assignment: a comparative study in digital libraries". The Electronic Library. https://doi.org/10.1108/EL-03-2025-0102

Endorsement

Review

Supplemented By

Referenced By