Transformer and statistical models for LCSH assignment: a comparative study in digital libraries

Usta, Gökhan

Yayın:
Transformer and statistical models for LCSH assignment: a comparative study in digital libraries

Dosyalar

EL-03-2025-0102.pdf (2.32 MB)

Tarih

2025

item.page.authors

Usta, Gökhan

Yayımcı

Emerald Publishing

Özet

Purpose - This study aims to examine the effectiveness of machine learning models and ensemble approaches for automating Library of Congress Subject Headings (LCSH) assignment to graduate theses and dissertations, aiming to enhance the efficiency, scalability and accuracy of library subject indexing in the digital age. Design/methodology/approach - A comparative quasi-experimental framework assessed five machine learning models (DeBERTa-v3-base, all-mpnet-base-v2, FastText, Omikuji Bonsai, term frequency-inverse document frequency [TF-IDF]) and two ensemble strategies (hybrid: DeBERTa + MPNet; ensemble: FastText + Omikuji Bonsai + TF-IDF) on a dataset of 1,104,600 thesis and dissertation titles across 1,578 LCSH labels, integrating organic and synthetic data. Synthetic titles were generated using large language models and rigorously validated to mitigate bias and prevent dataset imbalance. The performance was evaluated using F1, recall@5, NDCG@5, MRR and computational efficiency metrics (RAM usage and prediction time). Paired t-tests were conducted to confirm statistical significance of key performance differences. Findings - Transformer-based models (DeBERTa-v3-base: F1 0.7348; all-mpnet-base-v2: F1 0.7277) excelled in accuracy, whereas statistical models (e.g. FastText: 0.36 MiB, 0.0006 s) offered superior efficiency. The hybrid model achieved the highest F1 (0.7413) and NDCG@5 (0.8130) and the ensemble model led in recall@5 (0.8824), demonstrating the value of model integration. Ablation results showed that synthetic data substantially improved classification and ranking performance of models. Synthetic data improved dataset balance, enhancing model generalization. Originality/value - This study provides a novel comparison of transformer-based and statistical machine learning models for LCSH assignment, validated through both ablation and statistical significance testing, pioneering the use of synthetic data and probability-weighted ensembles to improve accuracy and ranking. It offers actionable insights for library automation, bridging gaps in prior research focused on narrower model sets.

Konusu

automatic subject indexing, Library of Congress subject headings, LCSH, machine learning, transformer models, ensemble approaches, synthetic data, library automation

Alıntı

Usta, G. (2025). "Transformer and statistical models for LCSH assignment: a comparative study in digital libraries". The Electronic Library. https://doi.org/10.1108/EL-03-2025-0102

URI

https://doi.org/10.1108/EL-03-2025-0102
http://hdl.handle.net/11527/27936

Koleksiyonlar

Yapay Zeka ve Veri Mühendisliği

Detay Görünüm

Yayın:
Transformer and statistical models for LCSH assignment: a comparative study in digital libraries

Dosyalar

Tarih

item.page.authors

Dergi Başlığı

Dergi ISSN

Cilt Başlığı

Yayımcı

item.page.projects

item.page.org-units

item.page.journal-issue

Özet

Açıklama

Konusu

Alıntı

URI

Koleksiyonlar

Endorsement

Review

Supplemented By

Referenced By

Yayın: Transformer and statistical models for LCSH assignment: a comparative study in digital libraries

Dosyalar

Tarih

item.page.authors

Dergi Başlığı

Dergi ISSN

Cilt Başlığı

Yayımcı

item.page.projects

item.page.org-units

item.page.journal-issue

Özet

Açıklama

Konusu

Alıntı

URI

Koleksiyonlar

Endorsement

Review

Supplemented By

Referenced By

Yayın:
Transformer and statistical models for LCSH assignment: a comparative study in digital libraries