ISTANBUL TECHNICAL UNIVERSITY  GRADUATE SCHOOL M.Sc. THESIS JULY 2025 MULTI-LABEL CLASSIFICATION OF 12-LEAD ECG SIGNAL USING A MIXTURE-OF-EXPERTS TRANSFORMER MODEL Atalay ÇELİK Department of Data Engineering and Business Analytics Big Data and Business Analytics Programme Department of Data Engineering and Business Analytics Big Data and Business Analytics Programme JULY 2025 ISTANBUL TECHNICAL UNIVERSITY  GRADUATE SCHOOL MULTI-LABEL CLASSIFICATION OF 12-LEAD ECG SIGNAL USING A MIXTURE-OF-EXPERTS TRANSFORMER MODEL M.Sc. THESIS ATALAY ÇELİK (528211093) Thesis Advisor: Asst. Prof. Mehmet Ali ERGÜN Veri Mühendisliği ve İş Analitiği Anabilim Dalı Büyük Veri ve İş Analitiği Programı TEMMUZ 2025 İSTANBUL TEKNİK ÜNİVERSİTESİ  LİSANSÜSTÜ EĞİTİM ENSTİTÜSÜ UZMANLARIN KARIŞIMI BAZLI DÖNÜŞTÜRÜCÜ MODELİ İLE 12 KANALLI EKG SİNYALİNİN ÇOK ETİKETLİ SINIFLANDIRILMASI YÜKSEK LİSANS TEZİ Atalay ÇELİK (528211093) Tez Danışmanı: Dr. Öğr. Üyesi Mehmet Ali ERGÜN v Thesis Advisor : Asst. Prof. Mehmet Ali ERGÜN .............................. Istanbul Technical University Jury Members : Assoc. Prof. Dr. Ömer Faruk BEYCA ............................. Istanbul Technical University Asst. Prof. Merve ŞAHİN .............................. Ibn Haldun University Atalay Çelik, a M.Sc. student of İTU Graduate School student ID 528211093, successfully defended the thesis/dissertation entitled “MULTI-LABEL CLASSIFICATION OF 12-LEAD ECG SIGNAL USING A MIXTURE-OF- EXPERTS TRANSFORMER MODEL”, which he prepared after fulfilling the requirements specified in the associated legislations, before the jury whose signatures are below. Date of Submission : 30 May 2025 Date of Defense : 1 July 2025 vi vii To my dear wife and family, viii ix FOREWORD Because of my main interest area of LLMs I have researched and worked with transformer architectures a lot. I have been curious about implementing this architecture to fields other than LLMs. One of the fields I wanted to implement was the time-series based domains. Although the healthcare and signal processing field are not my expertise areas, the use case of 12-lead ECG classification caught my attention as a possible applicable area. Adapting to these new areas has been challenging within the strict time period. My lack of expertise in these areas slowed me down during the study. Despite these challenges, I am happy with the results of the study. Working with a model architecture from dataset level to the training level has been a valuable learning experience and a tough challenge. I want to thank my advisor who helped me with the topic selection and advised me when I struggled with specific tasks in my thesis. The thesis has become a part of my marriage, which came to an end now. I want to specially thank my wife who has been supportive and patient towards me and my thesis. I also want to thank my family who has been encouraging me from afar. May 2025 Atalay ÇELİK x xi TABLE OF CONTENTS Page FOREWORD ............................................................................................................. ix TABLE OF CONTENTS .......................................................................................... xi ABBREVIATIONS ................................................................................................. xiii SYMBOLS ................................................................................................................ xv LIST OF TABLES ................................................................................................. xvii LIST OF FIGURES ................................................................................................ xix SUMMARY ............................................................................................................. xxi ÖZET ............................................................................................................. xxiii INTRODUCTION .................................................................................................. 1 Information About Electrocardiogram ............................................................... 1 Diagnostic Challenges ........................................................................................ 3 Machine Learning-Based Approaches ............................................................... 3 Transformer-Based Models ................................................................................ 5 Purpose of the Thesis ......................................................................................... 6 LITERATURE REVIEW ...................................................................................... 7 Datasets .............................................................................................................. 7 Traditional Machine Learning Approaches ........................................................ 7 Deep Learning-Based Approaches ..................................................................... 8 Transformer Architecture: Foundation and Evolution ....................................... 9 Transformer Applications in ECG Classification ............................................ 10 Mixture-of-Experts Models .............................................................................. 13 Mixture-of-Experts for Time Series Classification .......................................... 14 Mixture-of-Experts Applications in ECG Analysis ......................................... 14 Mixture-of-Experts Transformer for ECG Classification ................................ 14 METHOD ............................................................................................................. 17 Dataset .............................................................................................................. 17 Preprocessing ................................................................................................... 20 3.2.1 Outlier analysis ......................................................................................... 23 3.2.2 Label encoding .......................................................................................... 25 Feature Extraction ............................................................................................ 26 Model Architecture .......................................................................................... 27 3.4.1 Signal projection and early feature fusion ................................................ 28 3.4.2 Transformer ............................................................................................... 29 3.4.3 Mixture-of-expert ...................................................................................... 32 Training ............................................................................................................ 34 3.5.1 Evaluation metrics ..................................................................................... 36 3.5.1.1 Macro and micro AUC ....................................................................... 36 3.5.1.2 Top-K accuracies ............................................................................... 36 3.5.1.3 Precision, recall and F1 score ............................................................ 37 3.5.1.4 Exact match ........................................................................................ 37 3.5.1.5 PhysioNet challenge metric................................................................ 37 3.5.2 Experimentation ........................................................................................ 38 xii RESULTS .............................................................................................................. 43 Results of the Proposed Model ......................................................................... 43 Comparisons ..................................................................................................... 45 4.2.1 BiLSTM .................................................................................................... 46 4.2.2 CNN .......................................................................................................... 46 4.2.3 Transformer ............................................................................................... 47 4.2.4 Comparison results .................................................................................... 47 CONCLUSIONS AND RECOMMENDATIONS ............................................. 49 REFERENCES ......................................................................................................... 51 APPENDICES .......................................................................................................... 59 APPENDIX A: Dataset Information ...................................................................... 60 CURRICULUM VITAE .......................................................................................... 63 xiii ABBREVIATIONS AI : Artificial Intelligence ANN : Artificial Neural Network AUC-ROC : Area Under The Curve - Receiver-Operating Characteristic Curve BCE : Binary Cross Entropy BERT : Bidirectional Encoder Representations From Transformers BiLSTM : Bi-Directional LSTM CAD : Coronary Artery Disease CLIP : Contrastive Language-Image Pre-Training CNN : Convolutional Neural Networks CPSC : The China Physiological Signal Challenge CVD : Cardiovascular Disease DAE : Denoising Autoencoders DNN : Deep Neural Network ECG : Electrocardiogram FFN : Feed Forward Neural Network FIR : Finite Impulse Response G12EC : Georgia 12-Lead ECG Challenge GELU : Gaussian Error Linear Unit GPT : Generative Pre-Trained Transformer GPU : Graphics Processing Units GRU : Gated Recurrent Unit IIR : Infinite Impulse Response KL : Kullback-Leibler k-NN : K-Nearest Neighbour LDA : Linear Discriminant Analysis LLM : Large Language Model LSTM : Long Short Term Memory MAD : Mean Absolute Deviation ML : Machine Learning MoE : Mixture-Of-Experts NLP : Natural Language Processing PCA : Principal Component Analysis PPG : Photoplethysmography PSD : Power Spectral Density ReLU : Rectified Linear Unit RF : Random Forest RNN : Recurrent Neural Networks RoPE : Rotary Positional Embedding SHAP : Shapley Additive Explanations SNR : Signal-To-Noise Ratios SVM : Support Vector Machine ViT : Vision Transformers WFDB : Waveform Database xiv xv SYMBOLS dk : Head Dimension Kh : Key Matrix mV : Millivolts Qh : Query Matrix Vh : Value Matrix xvi xvii LIST OF TABLES Page Table 3.1 : Dataset information. ................................................................................ 18 Table 3.2 : Merged diagnoses groups........................................................................ 19 Table 3.3 : Statistics of datasets. ............................................................................... 19 Table 3.4 : Outlier distribution for each dataset. ....................................................... 25 Table 4.1 : Evaluation results of proposed model for test and validation set. .......... 44 Table 4.2 : Number of records for 5 most frequent diagnoses and F1 scores. .......... 45 Table 4.3 : Number of records for 5 least frequent diagnoses and F1 scores. .......... 45 Table 4.4 : Evaluation metrics for model comparisons. ............................................ 47 Table A.1 : Class distribution. ................................................................................... 61 xviii xix LIST OF FIGURES Page Annotated ECG signal waveforms (Rawshani, n.d.-a)............................ 2 Figure 3.1 : Co-occurence heatmap of top 10 diagnoses. ......................................... 20 Figure 3.2 : Outlier examples for different detection methods. ................................ 23 Figure 3.3 : External features extracted from ECG record with Neurokit2. ............. 27 Figure 3.4 : Preprocessing pipeline and model architecture. .................................... 28 Figure 3.5 : Details of the transformer architecture, inputs and outputs. .................. 30 Figure 3.6 : Internal structure of mixture-of-experts module. .................................. 32 Figure 3.7 : Comparison of total loss and F1 scores for model dimensions. ............ 39 Figure 3.8 : Comparison of auxiliary loss and F1 scores for number of experts. ..... 39 Figure 3.9 : Comparison of total loss and F1 scores for different projections. ......... 40 Figure 3.10 : Comparison of total loss and F1 scores for learning rates................... 40 Figure 3.11 : Dead neuron percentage plots for different encoder layers. ................ 41 Figure A.1 : Example header file. ............................................................................. 60 Figure A.2 : Distribution of number of diagnoses by dataset. .................................. 60 xx xxi MULTI-LABEL CLASSIFICATION OF 12-LEAD ECG SIGNAL USING A MIXTURE-OF-EXPERTS TRANSFORMER MODEL SUMMARY Electrocardiogram (ECG) measures the electrical activity of the heart and is an important indicator for the detection of cardiac abnormalities. In general, ECG signals are measured from 10 different nodes, and 12 different leads are derived from these measurements. This is done to capture activity from different angles of the heart, therefore each lead captures different information about cardiac rhythm. Some diagnosis types can be detected with detailly analyzing these ECG records by looking at the specific characteristics of the signal such as peaks, transitions between peaks. Abnormal patterns in these signals can only be detected by the experts in the domain, although there are still challenges in the implementation phase. Automation of this process has been in the interest of researchers for a long time. Most early work focuses on the peak detection and beat classification tasks. The advancements in machine learning and deep learning have saturated and introduced successful implementation cases which have high accuracy and effectiveness for these tasks. As a more complex and sophisticated problem, automatic detection of the diagnosis is a common task which is being studied. Competitions such as PhysioNet CinC competition targeted this domain for several years and achieved great interest and results. The availability of datasets for ECG records accelerated these research interests. Several great datasets are open-source and available for research purposes. With the advancements in the large language model domain with the new model architecture called transformers becoming more prevalent, new approaches to the problem have emerged. Several studies implement transformer-based models for the ECG classification task. The embedded attention module in the model and high compute capability creates a great potential for signal computation. The usage of these models are still scarce and in development for time-series based data. Deep learning based methods such as long short term memory or gated recurrent units dominate and are commonly used. Even though there are great studies being done with transformer architecture, they mostly focus on forecasting based solutions for fields such as finance and weather forecasting. Another branch of transformer-based models is the mixture-of-expert approach where multiple experts are introduced within the model where the activation of these experts are controlled based on the incoming data. As in the making of this study, literature lacks implementation use cases with this model characteristics. This study aims to implement a mixture-of-experts based transformer model for the classification of ECG records into multiple diagnoses in a multi-label manner. Each record has 10 seconds of ECG record in 500 Hz frequency with demographic features available. There are a total of 26 different diagnosis labels and each record have one or more labels attached to it. These labels are the target which is being predicted. For this study 81926 different labeled ECG records are used from six different datasets. xxii For the preprocessing and outlier analysis of the datasets, a digital signal filtering approach is used with finite impulse filtering and each record is normalized. For the preprocessing steps, different configurations are tested and the most optimal parameter set is chosen according to signal-to-noise ratio. For the outlier analysis, a triple voting system is used with three different methods; Z-score, principal component analysis and isolation forest. The records which receive 2 out of 3 votes are removed from the dataset. Another important step is to extract external features from the ECG records to feed into the model. In this study, several methods are used to extract features such as peaks and offset values. The model is constructed by feeding the signal values from 12 different channels of ECG and the extracted signal features. These features then are concatenated with demographic features. Signals and the features are fed into the model with 1D convolutional layers to enhance the time-dependent features. All of these features are projected into the model. The main model block includes normal encoder blocks with self-attention layers, pre-layer normalizations and skip connections. After three usual encoder blocks, three special encoders with mixture-of-experts blocks are used. This main transformer model attends to important information between time tokens. Mixture-of-experts modules have special gating networks which route the time tokens to different experts depending on their characteristics. The training setup is carefully designed to experiment with different configurations. Each parameter is optimized with a uniform subset of all data. The main model is trained with all data with optimized configuration for 20 epochs with learning rate of 3e-3 with cosine annealing. Batch size of 16 is used with gradient accumulation steps of 16, making the effective batch size as 256. The training runs use dropout, warm up steps and early stopping for effective training. The model inner dimension is set as 96 and the feed-forward network as 384, there are 6 encoders and 6 heads for each attention head. The mixture-of-experts are composed of 4 experts with top-1 routing. For testing and evaluating special metrics for the task is constructed. The trained model on 80 % of the data as a training set is optimized on a 10 % validation set and tested on the 10 % test set. The tests resulted in 59.98 % macro F1 score, 54.17 % exact match score, 55.33 % top-1, 75.22 % top-2, 84,80 % top-3 accuracies and 95.74 % macro AUC-ROC value are achieved on 6 different dataset and 26 diagnosis with multi-label. The model achieves a great result for a difficult scenario of ECG classification task. The task requires a high level of expertise and is a complex problem with different facets. xxiii UZMANLARIN KARIŞIMI BAZLI DÖNÜŞTÜRÜCÜ MODELİ İLE 12 KANALLI EKG SİNYALİNİN ÇOK ETİKETLİ SINIFLANDIRILMASI ÖZET Elektrodiyagram (EKG) kalbin elektriksel aktivitesini ölçer ve kalp hastalıklarının tespitinde önemli bir araçtır. EKG verileri hastalardan 10 farklı elektrot ile toplanmaktadır, bu veriler sonrasında yapılan hesaplamalar ile 12 farklı kanal için veri oluşturulmaktadır. Bu kanallar kalbin farklı açılardan elektriksel aktivitesini belirtmektedir. EKG verileri kalp hastalıklarının teşhisinde yoğun olarak kullanılmaktadır. EKG sinyalleri kalp atışına bağlı olarak karakteristikleri bilinen elektriksel sinyallerden oluşmaktadır. Bu sinyaller farklı kanallarda farklı davranışlara yol açmaktadır. P, Q, R, S, T gibi dalga tipleri, sinyallerin karakteristik bölümlerini tanımlamak için kullanılan terimlerdir. Bu dalgaların konumları, sinyal değerleri, birbirlerine göre konumları teşhis için önemli bilgiler içermektedir. Ancak 12 farklı kanaldan gelen veriler için doğru tanıyı koyabilmek için değerlendirici farklı uzmanlıkları ve yöntemleri kullanmak zorundadır. Tanıların doğru bir şekilde konulabilmesi için yüksek derecede literatür bilgisine ve deneyime sahip olunması gerekmektedir. Doğru teşhisler ancak alanda uzman doktorlar tarafından bu verilerin detaylı incelenmesi sonucunda koyulabilmektedir. EKG verilerinden kalp hastalıkları için tanı konulma sürecinin otomatik bir hale getirilmesi araştırmacılar tarafından ilgi ile araştırılan bir konudur. Konu hakkındaki öncül çalışmalar EKG verileri üzerinden tepe noktaların tespiti ve atımların sınıflandırması gibi amaçlarla yapılmıştır. Ancak bu konular makine öğrenmesi ve derin öğrenme gibi yöntemlerdeki gelişmeler ile literatürde sıkça çalışılmış ve yüksek doğruluk oranları ile yeterli olgunluk seviyesine ulaşmıştır. Daha kompleks bir problem türü olan otomatik tanı koyma problemi ise hala çalışılan ve yüksek başarım ile tamamlanmamış bir alandır. Bu alandaki çalışmalar makine öğrenmesi gibi klasik yöntemler dışında daha gelişmiş model tiplerini de kullanarak problemi çözmeye çalışmaktadır. PhysioNet platformu tarafından düzenlenen CinC gibi yarışmalar ile birkaç senedir bu alanda araştırmacıları motive etmektedir. Bu çalışmalar kapsamında EKG sinyallerinin sınıflandırma problemi konusunda ilerlemeler katedilmiştir. Bu alandaki çalışmaları motive eden bir diğer sebep ise açık kaynak olarak paylaşılan veri setlerinin fazlalığıdır. Dünyanın farklı bölgelerinden araştırma amacıyla paylaşılan bu veri setleri araştırmaları hızlandırmaktadır. Büyük dil modelleri alanında dönüşüme sebep olan dönüştürücü tabanlı modellerin bu alanda da uygulanması sonucunda yeni yaklaşımlar uygulanmaktadır. Birçok farklı çalışma dönüştürücü tabanlı modellerin içerisinde yer alan öz-dikkat ve yüksek hesaplama niteliklerini kullanarak EKG sınıflandırma problemini çözmeye çalışmaktadır. Zaman serisi temelli veri setleri için dönüştürücü tabanlı modellerin kullanımı kısıtlıdır ve uzun kısa süreli bellek gibi modeller daha yoğun bir şekilde kullanılmaktadır. Dönüştürücü tabanlı modeller zaman serisi anlamında genellikle tahminleme amacıyla finans, hava durumu tahmini gibi alanlarda kullanılmaktadır. xxiv Uzmanların karışımı (mixture-of-experts) modelleri büyük dil modelleri alanında farklı implementasyon örnekleri ile başarılı bir şekilde uygulanmış bir metottur. Bu metot için modeldeki hesaplama katmanları birden fazla uzmana bölünerek bir yönlendirme katmanı aracılığıyla gelen girdiler karakteristiğine göre uzmanlara yönlendirilmektedir. Yönlendirme fonksiyonu da modelin öğrendiği bir katman olarak belirlenmektedir. Uzmanlar arasında eşit bir şekilde girdilerin dağıtılması, uzmanların eşit kullanım oranlarına sahip olması amacıyla farklı kayıp fonksiyonları modelin genel kayıp fonksiyonuna dahil edilmektedir. Bu çalışmanın yapıldığı süre içerisinde EKG sınıflandırma alanında bu model çeşidi kullanılarak yapılan çalışmalar çok nadir bulunmaktadır. Bu çalışmada uzmanların karışımı bazlı bir dönüştürücü model kullanılarak bir veya daha fazla tanı etiketine sahip EKG verilerinin birden çok olabilecek şekilde sınıflarının tahminlenmesini amaçlamaktadır. Bu tahminler 10 saniyelik 500 Hz frekansına sahip 12 kanallı EKG verileri ve bu verilerin toplandığı hastanın yaş ve cinsiyet gibi demografik bilgileri kullanılarak yapılacaktır. Toplamda 26 farklı tanı bulunmaktadır ve her EKG kaydı en az bir tanı bulundurmaktadır. Tahminlenmesi amaçlanan etiketler bu tanılardır. Çalışmada toplamda altı farklı veri tabanından toplanan 81926 farklı etiketli veri kullanılacaktır. Verilerin ön işleme adımları için dijital sinyal işleme yöntemleri kullanılmıştır. Öncelikle verilerin standart bir uzunluğa sahip olabilmesi için her bir veri kaydı 10 saniyelik ve 500 Hz frekansına sahip olacak şekilde kırpılmış ve yeniden örneklenmiştir. Sonrasında bu kayıtlarda yer alan elektrik nedenli gürültülerin önlenmesi için farklı filtreleme yöntemleri denenmiştir. Butterworth, sonlu dürtü yanıtı (finite impulse filter) ve sonsuz dürtü yanıtı (infinite impulse filter) filtre yöntemleri ve farklı frekans değerleri sinyal-gürültü oranlarına göre karşılaştırmalı olarak bir alt grup üzerinde denenmiştir. Daha kaliteli sonuç verdiğinden dolayı sonlu dürtü yanıtı ve 500 Hz üzerinde karar kılınmıştır. Bunun dışında uç örneklerin tespit edilmesi amacıyla üç farklı metod ile bir oylama sistemi kurgulanmıştır. Z-skor, temel bileşen analizi ve izolasyon ormanı yöntemleri kullanılarak her bir kaydın öznitelikleri incelenerek bazı sınır değerleri aşan veriler işaretlenmiştir. Her bir veri için oylama sonucunda 3 oydan 2’sini alan veriler veri setinden çıkarılmıştır. Bu sayede daha kaliteli bir veri içeriği elde edilmiştir. Literatürde de sıkça başvurulan bir farklı yöntem ise modele sinyal değerleri haricinde ek dış verilerin de beslenmesidir. Bu aşamada da bilinen yöntemler kullanılarak EKG kayıtlarının her bir kanalı için farklı dalga tiplerinin tepe, başlangıç ve bitiş noktaları gibi tespit edilmiş ve ek öznitelikler olarak kaydedilmiştir. Model mimarisi kompleks bir şekilde tasarlanarak zaman bazlı olan bu verinin detaylıca öğrenilebilmesini amaçlanmıştır. Öncelikle modele girdi olarak beslenebilmesi ve zaman bağlamındaki özniteliklerinin öğrenilebilmesi için 12 farklı kanaldan gelen EKG sinyal verileri ve çıkarılan öznitelikler bir boyutlu bir konvolüsyon ağından geçirilerek boyutu artırılmış bir veri temsili elde edilmiştir. Elde edilen bu veri temsili sonrasında projeksiyon katmanları ile birleştirilmiş ve modelin ana katmanı olan dönüştürücüya girdi olarak beslenmiştir. Bu iki veri kanalı haricinde demografik veriler de benzer bir şekilde projeksiyon yöntemi ile modele bu veriler ile birlikte beslenmektedir. Dönüştürücü bloğunda temel özniteliklerin öğrenilebilmesi için kodlayıcı katmanların yarısı normal ileri beslemeli sinir ağı ile, diğer yarısı ise uzmanların karışımı modülü içeren sinir ağları kullanılarak oluşturulmuştur. Dönüştürücü tabanlı model verinin kendi içerisinde bir dikkat mekanizması kullanarak xxv ilişki ağı kurmayı ve veriyi daha iyi bir şekilde anlamayı amaçlamaktadır. Uzman yapısı ise farklı karakteristikteki parçaların farklı uzmanlara yönlendirilmesi ile modelin öğrenme katmanlarını düzenlemesini amaçlamaktadır. Bu şekilde model seyrek (sparse) davranış gösterebilmektedir. Eğitim aşamasında kullanılacak olan metrikler ve konfigürasyonların takibi, yapılan deneylerin incelenmesi ve kayıtlarının tutulması için genel hatları ile bir sistem kurulmuştur. Bu sistem içerisinde optimal ayarlamaların yapılabilmesi için verinin homojen dağılmış bir alt kümesi alınarak deneyler yapılmıştır. Bu deneyler kapsamında öğrenme oranı, model gömme büyüklükleri, uzman sayısı gibi bazı temel parametreler belirlenmiştir. Model iç katman gömme oranı 96 olarak, tüm sinir ağlarının gizli katmanının büyüklüğü ise 384 olarak belirlenmiştir. Dönüştürücü modelinde 6 farklı kodlayıcı bloğu, her bir blokta 6 farklı baş, uzman sisteminde ise toplamda 4 uzman ve anlık olarak tek uzman kullanılmıştır. Belirlenen optimal bazı parametre konfigürasyonları kullanılarak altı farklı veri setinden toplanan tüm veri üzerinde 3e-3 öğrenme oranı ile 20 döngü eğitim gerçekleştirilmiştir. Öğrenme oranı Cosine Annealing yöntemi ile eğitim aşaması boyunca azaltılmıştır. Eğitim yığın büyüklüğü 16 ve gradyan birikimi 16 olarak belirlenmiştir. Efektif anlamdaki yığın büyüklüğü böylece 256 olmaktadır. Eğitim başlangıcında ısınma adımları ile öğrenme oranı doğrusal bir şekilde baz öğrenme oranına kadar artmaktadır. Modelin veriyi ezberlememesi amacıyla erken durdurma uygulanmaktadır. Eğitim sırasında sınıflandırma görevi için belirlenen özel metrikler kullanılarak değerlendirmeler incelenmiştir. Eğitilmiş model, toplam verinin %10’u kullanılarak oluşturulan test setinde test edilmiştir. Testler sonucunda % 59.98 F1 skoru, % 54.17 tam eşleşme, % 55.33 top-1, % 75.22 top-2, % 84.80 top-3 doğruluk and % 95.74 AUC-ROC değeri elde edilmiştir. Modelin anlam analizinin yapılabilmesi için örnek veriler üzerinde GradCAM ve dikkat katmanları kullanılarak analiz edilmiştir. Çalışma sonucunda model EKG tanı sınıflandırması gibi zor bir görev için iyi bir başarım göstermiştir. Bu verilerin yorumlanması ve değerlendirilmesinin bazı durumlarda uzmanlar arasında da objektif olarak hala değerlendirilemediği göz önünde bulundurulduğunda modelin başarım seviyesinin bu gibi uygulamalardaki başarım oranlarına göre daha düşük olmasının nedeni anlaşılabilmektedir. Bunun dışında 6 farklı veri setinin kullanılması, veri setlerinin etiketlenme standartlarının farklılığı, etiketleyen uzmanların deneyim seviyelerindeki farklılıklar ve yanlış etiketlemeler veri kalitesini etkileyen bazı etmenlerdir. xxvi 1 INTRODUCTION Health care is one of the main implementation areas of artificial intelligence (AI) to efficiently and effectively diagnose, treat and research. These implementations are motivated by the wide potential practical use cases in healthcare, tasks which range from vision-based detections to genetic disease diagnosis. One of the main implementation areas is the diagnosis of chronic heart diseases. According to the World Health Organization (2024) and Centers for Disease Control and Prevention (2024) regardless of gender or racial and ethnic background, heart disease remains the leading cause of death. Given their substantial global impact on morbidity, mortality, disability, and healthcare costs, early diagnosis of cardiovascular diseases (CVDs), diabetes, and their risk factors is critical to implementing lifestyle modifications and providing timely treatment for affected individuals (Facciolà et al., 2022, p. 939). Information About Electrocardiogram Electrocardiogram (ECG) is a measurement method to measure electrical activity in the heart. These electrical measurements are taken from multiple locations which are recorded as electrical voltages and then are digitized. When plotted these numerical values form waveforms called ECG records. These records help medical staff diagnose heart related conditions. Based on the detailed article of Cardiovascular Medicine, the sensors to measure these electrical signals are called electrodes which are placed on 10 different locations on the body. These measurements taken from the electrodes are then used to calculate the 12 different values which are called leads. The signals of leads are calculated by taking an electrode as reference and another one or multiple as exploring electrodes. The 12-leads represent 12 different angles. Leads are mainly divided into two sections, limb leads and chest leads. Limb leads consist of six different leads, of which 3 are augmented. These are I, II, III, aVR, aVL and aVF; leads starting with ‘a’ letter represent the augmented leads. Augmented leads are calculated by taking the mean of two different leads. Limb leads measure the electrical activity in the frontal plane. In contrast, chest leads measure the electrical difference 2 in the horizontal plane. The six chest leads are called V1, V2, V3, V4, V5 and V6 (Rawshani, n.d.-b). The heart’s movements are activated by the electrical signals it receives which form the signals as wave patterns for each heart beating cycle. The process of heart beat is composed of several stages. In the atrial depolarization stage, the depolarization signal travels through the SA node to atria which form the P wave. This minor peak is followed by an onset until the QRS complex begins. This interval is called the PR interval. In the ventricular depolarization stage, the QRS complex in the ECG signal is formed. The Q, R and S values represent the first negative deflection, first positive deflection and following negative deflection periods. These periods form the ventricular depolarization and contraction steps of the heart. Following the QRS complex, the ST segment covers when both ventricles are fully depolarized. The T wave occurs after the onset of the ST segment which represents the repolarization of the contractile cells. For leads V2-V4 there is also a U wave that is sometimes observed. The depictions of these waves can be seen in Figure 1.1. The values of these waves change depending on the lead number since they represent different angles of electrical activity (Klabunde, 2023; Rawshani, n.d.-c). Annotated ECG signal waveforms (Rawshani, n.d.-a). In some occasions instead of a usual ECG signal measurement that ranges from 10 seconds to 120 seconds a Holter monitor device is used. The Holter monitor is a portable ECG device that measures heart activity for 24 hours or more (Holter Monitor, n.d.). 3 Diagnostic Challenges The 12-leads ECG record gives a general overview of the heart activity which is commonly used to detect several different diagnoses. These diagnoses are done by educated medical staff that are trained on interpreting these ECG signals. However, the inherent nature of the problem poses a significant challenge. The signal measurements are in millivolts (mV) and the signal is time-sensitive. On top of that, the number of types of diagnosis to consider is really high. There is a large amount of information in a 12-lead ECG wave signal. All of these factors make it difficult to interpret the signal clearly and effectively even by experienced staff. Even though there are great practices to understand and interpret the ECG records, there is not a simple way of implementing it. This also creates the problem of objectivity of interpretations where different clinical practitioners present different results because of the complexity of the data. According to a study by Dhutia and their friends, these gaps between interpretations of the same ECG records are present for not only inexperienced physicians but also for the experienced physician as well (2017, p. 7). The education about the ECG diagnosis is also a challenge in itself which is described in the work of Kashou et al. as only a fraction of the time is allocated to ECG interpretation among other fields in medical education (2020, p. 125). As a result of the small-scale measurements of the ECG signals the noise and variability of it can affect the diagnosis results dramatically. Noise sources such as power line interference, electrode contact noise, patient motion artifacts, electrical activity in muscles, signal processing hardware of software noise, quantization loss cause disturbance and changes in the profile of the original electrical activity (Clifford, 2006, p. 69). Machine Learning-Based Approaches ECG records are a type of signal received from multiple sensors that measure the electrical voltage values. Since these signals span a time interval, the problem becomes a signal processing problem. Signal processing has been around for a long time, mainly emerging from the field of communication (Shannon, 1948, p. 379). The methods in this field have been developing and evolving with the increase in computational power since the advancements in the hardware specifications. Furthermore, with the rise of 4 supervised and unsupervised pattern recognition models, the signal processing problems are also tackled with these new models (Cooper & Cooper, 1964, p. 416). Today, machine learning or related technologies are increasingly used in the medical field. From clinical decision support systems to evaluation of public health records, healthcare has been a great example of an application area of machine learning (ML) (Alanazi, 2022, p. 3). Not only the increase in computational power, but also the availability of large datasets and data collection systems drive these application cases, since ML performs better in the presence of diverse data sources (Mazumder, 2024, p. 8). Medical data in the form of signals are in need of advanced feature extraction methods because of their complex nature. ECG records are one of those cases where complex behaviors in the data can be observed. Without enough information, it becomes difficult to interpret the data. The problem of diagnosing the patient from their ECG record poses a great challenge which includes expertise in medical information, signal processing and machine learning or artificial intelligence-based methods. This interdisciplinary nature of the problem gives opportunity to apply a collection of advanced techniques together. After the surge of ML based applications, the field has moved to more complex and generalizable models called the deep learning-based approaches. These approaches can be scaled further to retain more information and contain more computational power within the model. For complex problems this advantage becomes the main reason for the performance gain over traditional ML methods. Traditional machine learning approaches excel at interpreting and estimating structured data such as financial data, whereas the deep learning-based approaches are much more useful for unstructured data such as image, speech and language (Sharifani & Amini, 2023, pp. 3900–3901). Deep learning-based approaches are particularly useful for understanding the hidden patterns that need computational load to discover. Since the inherent model architectures scale deeper or wider, there is more room for computation then the traditional approaches. As can be seen with the work of Yu and Zhang, one other difference is that the need for feature extraction is done outside of the model prediction with methods like principal component analysis (PCA) or linear discriminant analysis (LDA) in traditional methods. Deep learning-based approaches apply the feature extraction process as part of the architecture design such as recurrent neural networks (RNN), convolutional neural networks (CNN) or transformer. (Yu & Zhang, 2022, p. 5 217, 219). The introduction of features from outside gives the model an external information feed and bias, whereas letting the model figure out the patterns in the data without intervention lets the model generate representations of data by itself. This attribute of deep learning-based methods causes them to behave in a complex way which makes it hard to interpret. A study in the medical field concludes that deep learning-based methods and models become much more difficult as the scale of the model increases which is necessary for the increase in the accuracy of the models. The work suggests using multiple modalities to interpret the model behavior and design interpretability methods as domain-specific use cases, for example for medical diagnosis (Teng et al., 2022, p. 2351). Transformer-Based Models Development of deep learning did not only introduce a new way of approaching complex problems but also brought new possibilities in terms of architectural breakthroughs. After the surge of models in the field of image processing and recognition from basic CNNs to advanced architecture such as AlexNet (Krizhevsky et al., 2012), VGG (Simonyan & Zisserman, 2015), Inception (Szegedy et al., 2014), and ResNet (He et al., 2016), the abilities of the models expanded greatly. These developments drove the application areas of image recognition such as autonomous cars. The same architectural breakthrough happened for natural language processing. Tasks such as machine translation and language understanding were approached by specialized deep learning models, RNN (Rumelhart et al., 1986) and long short term memory (LSTM) (Hochreiter & Schmidhuber, 1997) mainly. An important work by Sutskever, Vinyals and Le introduced the encoder-decoder LSTM architecture for machine translation tasks, which achieved high accuracy compared to other methods. This work introduced the encoder-decoder architecture into natural language processing (2014). In the same time period, another important method is successfully applied to the machine translation area which is the attention mechanism. After these silent developments in the field, in the work of Vaswani et al. a combination of these methods such as residual connections from ResNet architecture, attention mechanism and encoder-decoder architecture, a new model type called transformers has emerged as a successful implementation in language understanding 6 (2017). This architectural breakthrough has brought attention to other fields from generative models to audio understanding and played the role of a catalyzer effect in the developments. The transformer architecture is mainly excelled at understanding time-sequences. Even though the most successful area of the implementation is language understanding and generative context, there are also use cases where time-dependent understanding is crucial. Areas such as weather forecasting (Kurth et al., 2023), and stock market prediction (Wang et al., 2022) are common areas where the architecture is being used. Even though the original transformer architecture stayed the same with minimal changes, a literature evolved around this topic to fully capitalize its potential with different configurations. One of these configurations is the mixture-of-experts approach proposed by Shazeer et al. where multiple experts are introduced into the model (2017). Purpose of the Thesis The reliable and quick diagnosis of heart related conditions independent of the medical stuff and their experience can have significant impact. This is especially crucial in regions with limited access to experienced professionals for ECG interpretation. This research explores the challenge of diagnosing a patient with multiple conditions by using their 12-lead ECG recordings combined with demographic features such as age and gender. The proposed approach is going to classify the ECG recordings with multi-class labels by using a mixture-of-expert (MoE) time-series transformer architecture with feeding exogenous feature information about the ECG waves and demographic attributes. The implementation will include the state-of-the-art architecture practices. 7 LITERATURE REVIEW The ECG classification problem has been in the attention of researchers for a long time. The availability of data in the field and the importance of the task has been motivating different studies with various technologies and methods. In this literature review, the work that includes machine learning technologies are going to be included starting from the traditional machine learning methods to the more advanced methods which include transformer architecture. Datasets For the ECG analysis and classification several open-source datasets are available. These datasets help the researchers build effective models with datasets with different distributions and characteristics such as race, gender and region. Some important datasets which are used commonly are INCART (Goldberger et al., 2000), MIT-BIH (Moody & Mark, 2001), PTB (Bousseljot et al., 2009), The China Physiological Signal Challenge (CPSC) (Ng et al., 2018), CODE 15% (Ribeiro et al., 2019), Chapman- Shaoxing (Zheng, Zhang, et al., 2020), PTB-XL (Wagner et al., 2020), Ningbo (Zheng, Chu, et al., 2020), SPH (Liu et al., 2022). Traditional Machine Learning Approaches Since discovery of hidden features and patterns are harder for traditional machine learning methods, studies include feature extraction methods from ECG data. In one study, wavelets are used to extract features from the ECG records, then the extracted features are used to train a support vector machine (SVM) to classify ECGs into six categories (Daamouche et al., 2012, pp. 343–344). Another study includes features such as QRS complex positions, RR interval, frequency features, and QS power and training an SVM classifier (Zidelmal et al., 2013, pp. 572–573). Li et al. proposed a five-level ECG signal quality classification method using a SVM trained on simulated data with added real ECG noise. The model is tested on both re-annotated and synthetic data (2014, pp. 437, 442). Diker et al. compared the SVM, artificial neural network 8 (ANN), k-Nearest Neighbour (k-NN) classifier models to classify ECG signals using morphological features (Diker et al., 2018). As seen in these studies, the feature extraction process had a great impact and importance for the classification process. Several studies also included methods such as Adaboost, Naïve bayes, Gaussian Naïve bayes, LDA, logistic regression and random forest (RF) (Celin & Vasanth, 2018; Hassaballah et al., 2023; Matin Malakouti, 2023; Pandey et al., 2020). As these studies show, a key limitation is the need to extract features externally from the models, which often prevents learning the temporal dependencies inherent in the time-sequence nature of ECG data. Deep Learning-Based Approaches Transition into deep learning-based approaches from traditional machine learning approaches changed how the features are extracted and fed into the model. Early work on neural networks for ECG classification included more signal processing steps. In one work, the authors used temporal, morphological features and wavelets to classify ECG beat types with intra-patient method on MIT-BIH dataset where the train and test set are taken from different segments of the same patient with a simple neural network with one hidden layer (Das & Ari, 2014). Another work uses RNN, Gated Recurrent Unit (GRU) and LSTM to estimate a simple binary classification task on the MIT-BIH dataset (Singh et al., 2018). Using CNN architectures on time series data is also an efficient way of learning the local temporal information with 1D CNN architectures. Tan et al. used CNNs along with LSTMs to detect a type of diagnosis called coronary artery disease (CAD) with high accuracy (Tan et al., 2018). Another work by Yildirim, uses multiple levels of decomposition with wavelets to extract features from ECG data. These features are used with an LSTM model to classify the record into five different categories successfully on the MIT-BIH dataset (Yildirim, 2018). Deep neural networks are also successfully applied to classify ECG records. In one study, the authors carefully gathered and annotated 91,232 ECG records and used a 34-layer deep neural network (DNN) architecture to classify the records into 12 different diagnoses without using any advanced preprocessing steps such as Fouriour or wavelets (Hannun et al., 2019, p. 68). Specially annotated datasets are used 9 extensively in this field of research. The work by Ribeiro et al., used more than 2 million labeled ECG recordings called CODE to estimate the diagnosis in 6 classes with high F1 scores with a model based on ResNet architecture as an end-to-end DNN model. The 15% of the dataset was then openly published as CODE-15% dataset (Ribeiro et al., 2019). Some studies focus on the explainability of the models, since the behavior of the models consisting of neural networks are harder than traditional methods. One work uses 1D CNN, Bi-directional LSTM (BiLSTM) and 2-D CNN in combination within a network on PTB-XL, CODE-15% and Chapman Arrhythmia datasets. The model is then analyzed with SHapley Additive exPlanations (SHAP) (Lundberg & Lee, 2017) and Grad-CAM++ (Chattopadhyay et al., 2017) to understand where the model looks at the ECG records (Ayano et al., 2024). Another work uses AlexNet based network to diagnose the beats on the MIT-BIH dataset into 5 different beat types with different strategies such as early, intermediate and late fusion. The work includes scalograms and phasograms to extract features from the recordings. The results are analyzed to understand to which parts of scalograms and phasograms does the model attend to understand the beat (Scarpiniti, 2024). Transformer Architecture: Foundation and Evolution The architectures evolved to a different phase after the breakthrough work of Vaswani et al. which introduced the transformer architecture, carried the studies beyond traditional machine learning methods and deep learning-based models (2017). The model combines multiple concepts under the same architecture such as encoder- decoder structure, self-attention mechanism, multi-head attention, positional encoding, layer normalizations, residual connections and feed forward networks. The model combines these known methods in a way that creates a robust model that allows time- sequence tokens to attend to each other and carry information through the sequence. Even though the transformer architecture fundamentally remained the same, some tweaks have been proposed and adopted as a better alternative to some parts of the model. First advancement is been the change in the positional encoding from the simple positional embedding used in original architecture to rotary positional embedding (RoPE) which integrates relative positional information directly into the self-attention mechanism through rotary transformations applied to the query and key 10 vectors (Su et al., 2023). Another small change to the original architecture is the change of post-layer normalization to pre-layer normalization (Xiong et al., 2020). Transformer inspired a lot of work in the field of natural language processing (NLP) and created the new field of research around generative models. Even though the model mainly revolved around the field of NLP, the robust abilities of it also transitioned itself into the image processing and time-series based works. Dosovitskiy et al. implemented the architecture to image recognition tasks successfully, which has created the branch of architectures of transformers called vision transformers (ViT) which specializes in understanding the image. Transformer Applications in ECG Classification One of the first implementations of transformer architecture to beat classification in the MIT-BIH dataset uses original transformer architecture and also the RR interval values are fused into the model before the final linear layer. The ECG data is fed into the model by tokenizing it with a 1D CNN architecture (Yan et al., 2019). Other simple implementations are also present that use the vanilla transformer to classify the heartbeats (Akan et al., 2023). Classification of ECG recordings has been an attractive field of study among researchers. PhysioNet, which organizes annual George B. Moody Challenges, organized a challenge called Classification of 12-lead ECGs: the PhysioNet/Computing in Cardiology Challenge 2020 (Alday et al., 2021). In this competition, the first placed team used a transformer-based model, where special hand-crafted features are selected according to a random forest model’s feature importance values. With the age and sex features, 22 features in total are selected as “wide” features. The “deep” feature extraction process includes an embedding part with convolution layers, and then the main transformer block where features about the raw ECG recording are extracted. These “wide” and “deep” features are then combined with a multi-label classification head to estimate the diagnoses (Natarajan et al., 2020). The same team has submitted another transformer-based architecture called waveform transformer in the 2021 PhysioNet/CinC challenge (Reyna et al., 2021), which projects multi-lead ECG recordings to the model by segmenting it to equal length segments and feeding the segments to the model by combining them with their positional encoding, the method is similar to the ViT embedding layer implementation (Natarajan et al., 2021). In another study, the authors used a two-branch system where each branch 11 had 1D CNNs to embed the data, which then is fed into the transformer model. In one branch the raw ECG data is given and in the other branch the RR interval values are given to the model, which are then concatenated (Atiea & Adel, 2022, p. 358). Embedding the ECG recording information with 1D CNN models has been used in another work. However, the work uses not only an encoder but also a decoder block which gets an input of object queries that are learnable parameters. The work focuses on the beat classification task and uses MIT-BIH dataset (Hu et al., 2022, p. 5). In a different study, ECG recordings are fed into the model with shifted time windows to save computational cost and let the model focus on local features in the recording. Moreover, the ECG recordings are split into patches to reduce the computation even more. The different length time windows are then given to different transformer blocks and concatenated in the end (Cheng et al., 2023, p. 10). Training types can change from supervised to alternative structures. According to one study, masking parts of ECG data and training a transformer to reconstruct the signal helps the model classify the signal better with learned spatiotemporal representations (Hu et al., 2023). Similarly, transformer architecture with CNN and denoising autoencoders (DAE) can be successfully applied for the heartbeat classification problem which is an alternative to usual supervised implementations (Xia et al., 2023). ViT based models achieve great results with the high capability of vision capabilities on ECG datasets. For example, one study finetunes the DinoV2 model on the 2 second ECG recording pictures of CODE-15% dataset, and it has reached high scores on test sets (Singh et al., 2023). Another work utilizing the ViT structure, uses recurrence plots, Gramian Anguler Field and Markov Transition Field in combination to classify the ECG signal into a simple binary classification problem (Indrasiri et al., 2024). Capitalizing domain knowledge is key in implementations. One work uses different embedding convolutional layers for limb and chest leads separately, and uses 2-D convolutional layers to fuse the features learned from all leads and use it as the value part of self-attention (Ji et al., 2024). Interesting work on generative pre-trained transformer (GPT) to classify the ECG recordings implemented by Fu et al., depicts the capability of an architecture combining the ECG records and their diagnosis text data to same latent space with a Contrastive Language-Image Pre-training (CLIP) like embedding (2024). A variation of the GPT model is used to train a base model for ECG and photoplethysmography (PPG) separately. The models do the prediction of 12 the next time stamp autoregressive. Authors studied the interpretability of the models extensively to understand the behavior of the model. Different self-attention heads specialize in important parts of the recordings such as R peaks, Q and P waves. The model is fine-tuned further for an atrial fibrillation classification task, then the model is analyzed again to understand the attended tokens (Davies et al., 2024). Other works including the generative structures forecast the ECG records with different time series foundational models, then benchmarked for their speed and performance (Ali et al., 2024). Special configurations of the transformer model can be used for this task to effectively classify the diagnosis. Authors have used a bidirectional transformer with multi-scale convolutions as an architecture with a special context-aware loss. The work uses convolutions with different kernel sizes for embedding of the records, then the embeddings are further refined with a module to calibrate the channel-wise patterns. The calibrated feature map and their reversed form is fed into the bidirectional transformer for classification (El-Ghaish & Eldele, 2024). In another study, the authors converted ECG recordings into images and used them as the primary training data, along with features extracted from raw signal values. They trained models based on 1D CNNs combined with GRUs or LSTMs, as well as 2D CNNs, ResNet, and Vision Transformers for arrhythmia classification, finding that the GRU and LSTM-based models outperformed both pretrained and scratch-trained ViT-based models (Apostol & Nutu, 2025). Foundational models for time series fields have been a researched in recent years such as Informer (Zhou et al., 2020), Reformer (Kitaev et al., 2020), FEDFormer (Zhou et al., 2022), PatchTST (Nie et al., 2022), Autoformer (Wu et al., 2021), Crossformer (Zhang & Yan, 2023). An adoption of this type of models to the medical field has been studied by Wang et al., where a foundational model for the medical field is trained with multi-channel patching and multi- granularity, afterwards tested on five different medical datasets including an ECG dataset. The performance on the PTB-XL dataset has not been successful compared to other foundational time series models (Wang et al., 2024). 13 Mixture-of-Experts Models After the great success of transformer models in the domain of generative tasks, the research in the large language model (LLM) field has moved to different architectures. Some work focused on architectures with baseline models such as state-space models like Mamba (Dao & Gu, 2024). On the other hand, some work focused on developing on the transformer models by building upon it. One of the main architectures which is adopted by different LLM developers is the MoE models. The MoE models are special architectures introducing expert and router modules. The main study by Dao and Gu has been published before the original transformer work as a separate structure using LSTM modules with router and experts. The work proposed using a special module which routes each token with a trainable gating network to one or multiple experts which are simple feed forward layers. The module output is a sparse combination of the expert layers. The number of experts to be activated is decided beforehand, and the activation rates are controlled by importance and load losses. The importance loss helps the model to not converge to use only a single expert, and the load loss ensures the experts receive a similar number of tokens. The trained models with MoE architecture achieved high results in machine translation tasks (Shazeer et al., 2017, pp. 2, 5). The MoE models are then merged with transformer implementation. First, the training optimization of the architecture is studied by Lepikhin et al. to shard the MoE experts to multiple graphics processing units (GPU) to train a 600 billion parameter model (2020). Afterwards another work, different from the original MoE work, proposed to route the tokens to only a single expert with the architecture called the Switch Transformers. This approach also introduced a capacity factor, which sets a fixed limit on the number of tokens each expert can process per batch. Tokens routed to an expert beyond this capacity limit are typically dropped, a mechanism designed to ensure balanced computational load and efficiency. The MoE models allowed the use of partial activation of the parameters in the inference to lower the required time to get output (Fedus et al., 2021). Zoph et al. refined the implementation further by introducing the z-loss to ensure the stability of the training runs. The same work concluded that using 1.25 capacity factor and top-2 expert routing is a successful implementation for sparse modeling in the LLM field (2022). 14 Mixture-of-Experts for Time Series Classification After the spread of MoE model implementations in the field of LLMs such as Mixtral (Jiang et al., 2024), the implementations in the time-series domain also became a relevant topic for researchers. One of the first studies on MoE based models on time- series data focuses on forecasting tasks. These tasks range from weather forecasting to transportation. Three separate models are trained, the largest model being a 2.4B parameter model with 1.1B activate parameters. The models are benchmarked against time-series forecasting foundation models and performed better at most datasets while remaining competitive with the activated parameter sizes (Shi et al., 2025). Mixture-of-Experts Applications in ECG Analysis Time-series based MoE models in the medical field are in the attention of the researchers recently. The success of the MoE models in LLM field and experimentations in the time-sequence domain motivated the research in the field. Han et al., contributed a broad study that focuses on variable number of modalities and potential missing inputs. The proposed architecture FuseMoE handles inputs of a patient such as vital signs, ECG signals, clinical notes and Chest-X-rays. The variability in the input type being time-series, text and image introduces several challenges which are attempted to solve with a special architecture of MoE transformer and gating function. The models are benchmarked on several datasets and performed better with the new gating function (2024, pp. 2–4). Another important work uses MoE architecture for the CODE-15% dataset to classify the ECG signals into 6 diagnoses. The important distinction of the study is pre-training the gating network to classify the diagnoses into three distinct super categories to help the model distinguish the related diagnosis. The MoE model achieved a 84.96 % F1 score on the CODE-15% test (Chaves et al., 2024). Mixture-of-Experts Transformer for ECG Classification Most work in the field of ECG classification approaches the problem with machine learning, deep learning, or transformer-based models. Classification tasks in ECG records usually include classification of beats or the diagnoses. Beat classification tasks require simpler approaches and models to classify and the literature around it is 15 mostly saturated with work that achieves high levels of accuracy. On the other hand, the detection of diagnosis from the ECG records requires more sophisticated approaches and models since the problem is more complex as it involves the whole ECG record. Also, the number of possible diagnoses exceeds the types of beats. The literature around diagnosis detection is still in development with different approaches being explored. The success of mixture-of-expert models in the field LLMs motivated the researchers to implement it to different fields of work. However, for ECG classification tasks, there are currently only a few studies on MoE-based transformer model implementations. In this work, a mixture-of-expert transformer model for diagnosis classification of ECG records is proposed with external feature fusion. To the best of the author's knowledge, this study focuses on an approach for which only limited research exists in the literature to date. 16 17 METHOD In the present study, the introduced method approaches the problem of ECG record diagnoses classification with a mixture-of-expert transformer model with external feature fusion. The work focuses on 26 different diagnoses types which are multi-class labeled from six different datasets. The aim of the study is to classify an ECG record by using its 10 second window and demographic features such as age and gender. Dataset ECG classification task is a common task explored for a long time and has attracted interest among the research communities. The main encouragement and motivation for these studies has been the existence and prevalence of datasets in the domain. Unlike some fields where the absence of open-source datasets inhibits the research efforts, in the healthcare field ECG records of patients from different countries and characteristics exist as open-source datasets. Since the work requires domain expertise to interpret, label and validate the records, existence of these types of datasets which are labeled by domain experts has an immense value for researchers from different fields. One of the oldest and most used datasets in this field is the MIT-BIH Arrhythmia dataset which is primarily used for beat classification purposes (Moody & Mark, 2001). Another important work for the diagnosis classification task is the PTB dataset, used extensively in the early work on the field, even up to this day. Even though the number of samples in the dataset is small, it is one of the main pieces which motivated studies (Bousseljot et al., 2009). Different parties and hospitals gathered such datasets for different purposes. The platform called PhysioNet, responsible for the publication of different open-source datasets in the medical field, collected such datasets and formed a competition about the classification of ECG records. PhysioNet performs the collection, standardization, storage and provision of the dataset (Goldberger et al., 2000). PhysioNet/Computing in Cardiology Challenge 2021 (Reyna et al., 2021) used multiple sources as databases, some open and some undisclosed for evaluation 18 purposes. In this work, some of the datasets that are available are used for training and evaluation purposes. All datasets are downloaded from the PhysioNet platform. The used datasets include CPSC, CPSC-Extra (Ng et al., 2018), PTB-XL (Wagner et al., 2020), Georgia 12-lead ECG challenge (G12EC) (Alday et al., 2021), Chapman- Shaoxing (Zheng, Zhang, et al., 2020) and Ningbo (Zheng, Chu, et al., 2020) databases. The competition included PTB and St. Petersburg INCART (Goldberger et al., 2000) datasets; however, because of the difference of the sampling frequencies and low number of records, these datasets are excluded from the training and evaluation processes. The details of the datasets can be seen in Table 3.1. The datasets only include the ECG record sample varying in length and the demographic features of the patient and the label of the diagnosis. The headers and the records are all standard and are in Waveform Database (WFDB) library format (Moody, 2022). An example header can be seen in the Figure A.1. As mentioned before, the dataset labels are in multi-class format; hence, a patient may have been diagnosed with multiple labels at once. Different diagnoses can coexist in different combinations, thus making the problem a multi-class classification problem. In the WFDB headers, the labels are provided with standard SNOMED-CT (SNOMED International, n.d.) codes. These codes match a specific type of diagnosis. Table 3.1 : Dataset information. Dataset Name Number of Records Average Duration (s) Frequency (Hz) PTB 116 113.63 1000 PTB-XL 21593 10.00 500 Chapman Shaoxing 9708 10.00 500 CPSC 2018 5274 15.40 500 CPSC 2018 Extra 1277 16.36 500 Georgia 9456 9.98 500 Ningbo 34468 10.00 500 St. Petersburg INCART 34 1800.00 257 The competition only scores part of the diagnosis. Because of the low number of occurrences among all the datasets, some diagnoses are excluded from the scoring to focus on diagnoses which have a considerable amount of samples. In the original datasets a total of 103 diagnoses are present, of which only 30 of them are scored. Since some diagnoses are similar to each other in terms of characteristics, the challenge owners decided to merge some of the diagnoses as a single diagnosis (Reyna & Sadr, 19 2021). The merged diagnosis can be seen in Table 3.2. After merging similar ones, 26 unique remain. Table 3.2 : Merged diagnoses groups. Merge Group Original Label Name SNOMED-CT Code 1 Complete Left Bundle Branch Block 733534002 Left Bundle Branch Block 164909002 2 Complete Right Bundle Branch Block 713427006 Right Bundle Branch Block 59118001 3 Premature Ventricular Contractions 427172004 Ventricular Premature Beats 17338001 4 Premature Atrial Contraction 284470004 Supraventricular Premature Beats 63593006 Before filtering the unscored diagnosis, all datasets include 88212 ECG records. After filtering only for the 26 unique diagnoses, 6286 diagnoses are filtered, leaving 81926 records which have a total of 129.307 diagnoses labels. The distribution of labels are in Table A.1. Each record has age and gender information regarding its patient in the header file. These demographic statistics and number of distinct labels can be seen in Table 3.3 for each dataset. Table 3.3 : Statistics of datasets. Dataset Average Age Number of Males Number of Females Distinct Number of Diagnoses PTB 45.82 85 31 5 PTB-XL 59.47 11229 10364 22 Chapman-Shaoxing 60.36 5485 4223 19 CPSC 2018 61.96 3011 2263 6 CPSC 2018 Extra 66.72 751 526 20 G12EC 60.67 5094 4362 23 Ningbo 58.08 19479 14975 25 St. Petersburg INCART 58.26 20 14 9 Of these databases, PTB and St. Petersburg INCART are excluded, leaving 81776 ECG records. Out of all the records 60.7% have a single class label, and 25.64 % have two class labels. Distribution of number of diagnoses by dataset can be seen in Figure A.2. The top diagnosis when all datasets are taken together is sinus rhythm which corresponds to a normal ECG record. However, there still could be different labels 20 other than sinus rhythm. Some diagnoses occur together more than others, whereas some coexist together rarely. The co-occurrence of diagnosis can be seen in Figure 3.1. Figure 3.1 : Co-occurence heatmap of top 10 diagnoses. Preprocessing Having clean data in training any neural network improves model performance. Since the model learns from the patterns of the data, if the data has any inconsistencies in it, the learned representation could be flawed; hence, causing the model to misinterpret it. In the context of this study, the preprocessing and preparation of the ECG records is an important step of preparing the dataset for training. Filtering and standardization of record, detection of outlier records and preparation of data for training are analyzed extensively. First of all, after the retrieval of the dataset from the PhysioNet database, the records are read with the scipy.io.loadmat function in Python, as instructed in the competition 21 manual (Alday et al., 2021; Reyna et al., 2021). As described beforehand, each record has its own length and frequencies, resulting in a different number of samples. The selected datasets for the study only include records that have 500 Hz frequency rate, however the length of the records change. The ECG signals are collected from 12 different channels as described in Section 1.1. These channels called leads cause a 12- lead ECG signal. To have a consistent input sample size, all records are padded or cropped accordingly. Since all of the records are in 500 Hz, no action is taken in terms of resampling. Most records in the dataset are in 10 second intervals. In some cases the record can be longer or shorter in duration, resulting in an incorrect number of samples per record. Therefore, if a record is shorter than 10 seconds, it is padded with zeros and if it is longer than 10 seconds a window is cropped randomly to shorten the duration to 10 seconds. This operation is performed on each channel, resulting in 5000 sample points in time domain per record for all 12 signal channels. Standardizing the number of sample points is crucial for feeding the records to the model. However, the more important part is to have a stabilized record which is cleansed of different noise sources. In the signal processing domain, this step is done by the signal filtering methods used extensively for frequency-based signals. As listed before, various sources can be the cause of such noise in the signal (Clifford, 2006, p. 69). In the context of this study, denoising ECG signals is a critical part of the preprocessing pipeline. Filters such as Butterworth, Infinite Impulse Response (IIR), and Finite Impulse Response (FIR) are commonly used in this step. The Butterworth filter is applied for its smooth response and ability to suppress high-frequency noise without introducing significant distortion (Butterworth, 1930). IIR filters, known for their efficiency and analog-based design, offer a practical solution for real-time filtering tasks. FIR filters, with their linear phase properties, are used in cases where phase integrity is essential. These filtering methods help ensure that the model is trained on cleaner and more consistent data, reducing the risk of misinterpretation caused by noise artifacts. To select the best alternative for filtering and sampling strategies, different configurations were compared by their respective signal-to-noise ratios (SNR) using a subset of 100 diverse ECG records. This analysis systematically evaluated various preprocessing configurations. For filtering, both FIR and Butterworth filters were 22 tested with a range of parameters, including different cutoff frequencies, filter orders, and numbers of taps. The performance of these filters was primarily assessed using the SNR, which was calculated from power spectral density estimates of the signals. Regarding the sampling frequency, the investigation analyzed the effects of multiple rates, such as 125 Hz, 250 Hz, 500 Hz, and 1000 Hz. Performance was judged on SNR, considering the trade-off with data size and computational load. In terms of filtering, while both FIR and Butterworth filters demonstrated capabilities in effective noise removal, the FIR filter was ultimately selected for this specific application. The chosen FIR configuration, derived from the analysis that maximized SNR, was determined to be suitable for preserving the critical morphological features of the ECG waveforms while effectively attenuating unwanted noise. Based on comprehensive assessment of the tradeoff between signal fidelity and data volume, a sampling frequency of 500 Hz was selected. This particular frequency was identified as providing a sufficiently detailed representation of the ECG signal necessary for the intended tasks, while also managing the computational load effectively. Therefore, the final preprocessing pipeline utilizes optimized FIR filtering and a 500 Hz sampling frequency, ensuring both high signal quality and processing efficiency. The optimal parameter set for the FIR filtering in terms of its number of taps and cutoff values are decided in the training process by checking different evaluation metrics and various literature studies, which will be described further. As a standard step in the preprocessing part, normalization of the records is performed for each channel separately using the Z-score normalization. The normalized signal is calculated as in Equation 3.1, for ith lead at time t where ϵ is a small positive value. 𝑍௜,௧ = 𝑋௜,௧ − 𝜇௜ 𝜎௜ + 𝜀 (3.1) As for the normalization of demographic features, the age values are normalized by dividing them with 100 assuming the maximum age is 100. The gender values are replaced with 0 and 1 for female and male. To handle the missing values for some of the age values, mean age values per dataset and gender are calculated and these missing values are replaced with averages. Some anomalies are handled with this approach. These anomalies include, 201 records in PTB-XL dataset that has 300 as their age value, 211 records in Ningbo dataset that has zero as their age value, 206 records in 23 various datasets that have “NaN” as their age value and 4 records in CPSC dataset that has -1 as their age value. As a result, a total of 622 records are replaced with mean age values depending on their dataset and gender. 3.2.1 Outlier analysis Another important step in increasing the quality of the dataset is to detect the outlier records which have measurement anomalies. These outliers can cause the model to learn wrong representations. Although the outliers still have labels, the characteristics of the records do not represent the usual diagnosis representation. To do this, a multi- method approach is used to detect the outliers. Three different methods are used and a majority voting of the methods is done to eliminate unwanted records. Some outlier examples can be seen in Figure 3.2. First, all records are first preprocessed with decided parameters as described before. To supplement the record with features two different feature sets are used; statistical features and frequency-based information. Statistical information include mean, median, standard deviation, minimum point, maximum point, 25 %, 75 % percentiles for every lead. For the frequency-based information the Power Spectral Density (PSD) of it the record is calculated with Welch’s method per lead. These features are fed into three distinct methods; Z-score method, PCA and Isolation Forest. Figure 3.2 : Outlier examples for different detection methods. 24 For the Z-score method, instead of using mean and standard deviation of these features, median and mean absolute deviation (MAD) values are used for the Z-score calculations for every feature. The MAD is calculated as the median of the absolute differences between each feature value and its respective median. The modified Z- score denoted as Z* is calculated as in Equation 3.2-3.4 where xik is the value of the kth feature for the ith record. This calculation is based on the work of Iglewicz and Hoaglin for the Z-score calculation to not be affected by few outliers (1993, p. 11). Since ECG records have period peaks which are higher than the average value of the record, modified Z-score is found to be a better alternative in this study. 𝑥௞෦ = 𝑚𝑒𝑑𝑖𝑎𝑛൫𝑥ଵ,௞, 𝑥ଶ,௞, … , 𝑥ே,௞൯ (3.2) 𝑀𝐴𝐷௞ = 𝑚𝑒𝑑𝑖𝑎𝑛൫|𝑥ଵ,௞ − 𝑥௞෦ห, |𝑥ଶ,௞ − 𝑥௞෦ห, … , |𝑥ே,௞ − 𝑥௞෦|൯ (3.3) 𝑍௜,௞ ∗ = 0.6745 |𝑥௜,௞ − 𝑥௞෦| 𝑀𝐴𝐷 (3.4) Each record which has more than 25 % of their features modified Z-score higher than the threshold of 3 are eliminated from the dataset. For the PCA score, each feature set’s components are calculated, the number of components is decided until the explained variance of features by components is at least 95 %. After calculating the PCA components the record features are transformed and reconstructed with them. Then the reconstruction error per feature is calculated with mean squared error. If the Z-score of the errors is higher than the threshold of 3, then it is flagged as outlier. Isolation Forest is an ensemble learning algorithm which provides an effective method for anomaly detection by isolating observations. The core principle is that outliers are few and different and are therefore more susceptible to isolation. The algorithm constructs a forest of random trees. For each tree, data is partitioned by randomly selecting a feature and then randomly selecting a split value between the minimum and maximum values of that feature. Abnormal instances, being distinct, tend to be isolated in shorter paths within these trees, closer to the root. The proportion of expected outliers called the contamination factor is calculated according to the size of the 25 subgroup that is being analyzed with a maximum of 10 % and the number of trees is set to 100. Since the characteristics of each diagnosis is different, calculating the features within the specific diagnosis need to be performed to make sure some diagnoses are not eliminated because of its characteristics, and not because of its abnormality. However, diagnosis groups which have a low number of examples for the methods to be effective are thought as a single group. The lower threshold for this number of examples is 20. After the detection of outliers with all of the methods over all the datasets, records that received 2 out of 3 votes are eliminated from the dataset. In total 929 records are flagged as outliers and removed from the scored datasets. The outlier distribution per dataset can be seen in Table 3.4. Table 3.4 : Outlier distribution for each dataset. Dataset # of Normal Records # of Outlier Records Outlier Percentage chapman_shaoxing 10143 102 1,00% cpsc_2018 6778 93 1,35% cpsc_2018_extra 3387 65 1,88% georgia 10235 107 1,03% ningbo 34501 385 1,10% ptb-xl 21685 141 0,65% Grand Total 86729 893 1,02% The number of outliers that are discarded by the outlier removal process corresponds to nearly 1 % of all the records. The discard ratio is small enough to not effect the training process. 3.2.2 Label encoding As the target of the model, the diagnoses information has to be provided to the model. A simple encoding approach is used to turn every label of record into a vector of 26 elements, representing the 26 diagnosis types. Since this is a multiclass problem, for every record there exist one or more positive labels in each target vector. The positive labels are indicated with 1 in the target vector. 26 Feature Extraction As stated in the Section 2, most work capitalize on extracting features from the raw records to enhance the performance of the models. Even though the models capture the semantic relationships between the parts of the records or any time-series sequence, giving external expertise information to the model navigates the model to capture the important parts of the record, hence increasing its performance. For the ECG time records, if every single time point is thought as a token, some tokens are important indicators or the places where any expert looks for to diagnose the patient. As conveyed earlier, ECG records have distinguishable points which are common on all the records. These are the peaks and dips in the record called P, Q, R, S and T points. Feeding these tokens directly to the model has been a primary method of using external information. In this work, these tokens are detected by using an ECG record feature extraction library called NeuroKit2 (Makowski et al., 2021). This library helps extract the useful features of the ECG record, which are the peaks, onsets and offset values of the specific points described. The library is not only used for ECG but for most biosignal processing and analysis. For each record and channel of the record, following features are extracted separately; onset, peak and offset value of point P, peak of point Q, onset, peak and offset of point R, peak of point S and onset, peak and offset value of point T. These features sum to 11 different features of different points. The points of these features are not necessarily a single point, but can be multiple points spanning the specific time section as seen in Figure 3.3. The detection of R- peaks is done by the default settings of the Neurokit2 library, which is argued to be an efficient way (Brammer, 2020). In this study’s experimentations, the method detected R-peaks successfully. The other feature points are also detected by using the default arguments, which use discrete wavelet transform (Martínez et al., 2004). In the experimentations for this specific study, the method performs better than the alternative methods which are continuous wavelet transform and peak-based method. 27 Figure 3.3 : External features extracted from ECG record with Neurokit2. These features are computed for all the channels and time tokens in the record, hence there are 11 feature points for every corresponding 5000 time series tokens. The feature vector consists of boolean values indicating the presence or absence of a feature point. A special approach is taken to compute these vectors. Since the computation of these features will be common for all the experimental runs, the NeuroKit2 features are pre computed for all of the datasets and stored as sparse matrix formats beforehand to decrease the training times. Model Architecture As the main objective of the study, a mixture-of-experts transformer model is used to classify the diagnoses. The model and training methods are adapted from the standard MoE architecture and practices from important studies (Fedus et al., 2021; Shazeer et al., 2017). For detailed explanation of the architecture each section of the model will be described separately. To explain the general flow of the model, the input of the model is the record itself from 12 different channels for 12 leads, 11 features precomputed for each channel and time point, the demographic features and the CLS token. The model output is the multi-class prediction logit vector for all 26 different diagnosis classes. These logits are then used to decide on the class predictions with optimized thresholds which will be explained further. The preprocessing steps, feature extraction and overall details about the model architecture can be seen in Figure 3.4. 28 Figure 3.4 : Preprocessing pipeline and model architecture. 3.4.1 Signal projection and early feature fusion First of all, to fuse the information from the external features of ECG records, an early fusion method is used to fuse the features directly with the records. To enhance the information channels from both the signal input and external features, multiple 1D CNN layers are included into the model. This multi layered CNN architecture allows the model to use different kernel sizes to learn fine-grained information in the time domain and increase the channel size which the information is carried. For both the signal and external features the 1D CNNs are three layer structures. The ECG signal layers have consecutive kernel sizes of 51, 31 and 11, and channel sizes of 32, 64 and 128. The external feature layers have consecutive kernel sizes of 201, 51 and 21, and channel sizes of 16, 32 and 64. The kernel sizes of external features are kept high since it is a sparse array activating only in peaks and offsets. The channel size is higher for the signal information since it inherently includes more information. These layers do not alter the time dimension by padding the tokens. Each 1D CNN is followed by a batch normalization layer and leaky Rectified Linear Unit (ReLU) activation function. For merging these channels a concatenation of the signal information and external features is done. This concatenation creates a new signal matrix with 5000 time tokens and 192 feature dimensions, 128 being the ECG signal’s leads and 64 being the extracted features. After this concatenation, to inject this information to the model a projection layer is used to scale the signal input to model dimension. Projection layers take the concatenated matrix as input and by using a linear layer outputs a matrix which is in shape of model dimension by sequence length. This projection is done to align signal 29 features to a single dimension which is the common model dimension used within the model. The signal features are then concatenated with the output of another similar projection layer which projects the demographic features of the patient to the same model dimension as a single vector in the time axis. Finally, a CLS token is included in the input by concatenating it in the time axis. Therefore, the input shape becomes 96 in feature dimension, same as model dimension, and 5002 in time dimension. The CLS token is used to carry the information for the classification decision of the whole model. Conveying this information with a single token is beneficial for the model to easily compress the information about the classification decision into a single vector instead of doing it along all the time sequence tokens. As a result, all signal information and external features are successfully fused and fed into the model. 3.4.2 Transformer The core model architecture is similar to the original transformer architecture with small tweaks with modern implementations of some methods. The original model uses an encoder-decoder model since the task of translation requires a generative output. However, in tasks such as embedding models or general feature extraction of any type of information, using only the encoder block is sufficient. These types of models are called Bidirectional Encoder Representations from Transformers (BERT) models (Devlin et al., 2018). The transformer model is composed of 6 layers, 3 of which are normal encoder layers and the rest are mixture-of-expert encoder blocks. The normal encoder blocks take the input described above and use a pre-layer normalization method to normalize the input. The pre-layer normalization applies the layer normalization before the self-attention block instead of the post-layer normalization used in the original work, which is seen as a common practice to improve the performance and stability of the training (Xiong et al., 2020). The layer normalization implementation follows the implementation of Ba, Kiros and Hinton (2016). Detailed depictions of the used methods can be seen in Figure 3.5. After the normalization of the input, it is forwarded to a multi-head self-attention layer where each token attends to other tokens and computes the amount of information to flow to each token by using query, key and value vectors. Each token first computes 30 its attention score to each token by using its own query vector’s dot product to every other key in the sequence including itself. This computation results in an attention vector with logits which is then scaled by dividing by the square root of the dimension of the key vector. This scaling is done to stabilize the gradients in the training. To transform these attention scores to probability values a softmax function is used. These probabilities represent how much does each token attend to other tokens in the time- sequence, which is an important information to understand the relationship in data. Figure 3.5 : Details of the transformer architecture, inputs and outputs. For this model, since there are also two extra tokens, demographic feature token and CLS token, these also attend to each time token and get attended to. This information flow is used within the model for updating the classification prediction and injection of demographic feature information. In the original work of the transformer, fixed sinusoidal absolute positional embedding is used to embed the positional information of the tokens. However, different authors suggested and implemented variations of positional embeddings for the stability of the representation of positions. Absolute embedding is considered insufficient in long context scenarios. The work which is adopted as the new standard, uses RoPE to embed the positional information. This method alters query and key values by changing their rotations before the calculation of attention scores. Change of the query 31 and key values provides a position information between the tokens relative to each other without introducing learnable parameters. Unlike absolute positional embeddings, the RoPE method does not degrade in a long context and has periodic understanding, which is a potential benefit in this task with a high number of tokens with a periodic characteristic. After embedding the positional information and calculating the attention scores, a softmax function is applied to these scores to obtain attention probabilities. These probabilities are then matrices multiplied with the value vector of each token. This results in a weighted sum of each value vector with attention probabilities. Mathematical operations of RoPE and self-attention are as seen in Equation 3.5 where Qh' represents query and Kh' represent key matrices after applying RoPE. Vh represents the value matrix which is not altered. The head dimension is represented with dk which is the model dimension divided by the number of heads. The RoPE implementation follows the study of Su et al. (2023). 𝐻𝑒𝑎𝑑௛ = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 ቆ 𝑄௛ ᇱ (𝐾௛ ᇱ )் ඥ𝑑௞ ቇ 𝑉௛ (3.5) This operation is done in multiple branches called heads. These heads cause the model to learn information in multiple representation spaces. Each head has its own learnable linear projection matrices to compute query, key and value vectors, causing them to carry information in different ways and attend to different parts of the time-sequence. After the multi-head self-attention block, residual connections are used to carry the original input information further without change. This helps the model to always attend to the original input. After the residual connection, another layer normalization is applied and then a feed forward neural network (FFN) which is a two-layer neural network which goes from model dimension to hidden dimension and model dimension again. The hidden dimension, as a common practice, is 4 times the model dimension. After the first layer a leaky ReLU activation is applied. The second time, a residual connection is formed from the first addition step. This series of operations are done in a block called encoder. There are 3 consecutive encoder blocks. The depth of these blocks increases the amount of computation performed and number of parameters of the model; hence, help in storing more information in the model. 32 3.4.3 Mixture-of-expert A special encoder block with the same architecture but with a different module instead of the FFN is used for the rest of the 3 encoder blocks. These modules are called the mixture-of-expert modules. The MoE modules are composed of several experts and a router to route the incoming tokens to one or more experts. The MoE approach has been used in transformer architecture and is found to be beneficial for performance in language models. Layer normalizations, residual connections and the self-attention blocks are the same as the other encoder blocks. Figure 3.6 : Internal structure of mixture-of-experts module. In the MoE module the incoming tokens are routed by a router gate in the beginning. The router uses a simple learnable linear layer to project the model dimension to a number of experts to calculate the router logits. After this step, for every token the top- k experts that have the highest routing logit are marked. Then, to route the tokens to experts, dispatch tensors are created with these markings. Each dispatch tensor acts as a mask to ensure the correct assignments to experts. The MoE module is depicted in Figure 3.6. In this study, a top-1 routing logic is used for the routing, however, other routing options are also experimented. For top-k routing options other than top-1 routing, the routing logits normalized with softmax function is used to form the dispatch tensor which acts as gating weights for each expert for each token. Each expert is a simple FFN with the same dimension as in normal encoder blocks. Intuitively, each expert acts as an independent FFN that are sparsely activated, causing the model to have a dropout like structure for different FFN modules. 33 To ensure that tokens are uniformly routed to experts, different loss components are included in the loss function called the auxiliary loss. There are two loss components; Z-loss and load balancing loss. The load balancing loss is responsible for creating an equally distributed load across all experts. In training, for each batch the token distribution divergence from normal distribution is compared and the Kullback-Leibler (KL) divergence value is calculated. When there is a great discrepancy between a normal distribution and model’s expert distribution this loss becomes greater. This loss component is introduced first by Shaazer et al., however in this work, unlike their work, the KL-divergence score is used instead of coefficient of variation (Shazeer et al., 2017). The other loss component is z-loss introduced by Zoph et al. to regularize the routing logits. The z-loss is calculated by taking the square of the sum of each logit's log of exponentials and averaging them across experts (2022). This component acts as a regularization loss, keeping the router logits low and penalizing large logits. This helps the model to not give extreme routing probability to a single expert and cause an expert collapse. These two loss components are added to the model loss by multiplying them with some coefficient, for each one. These coefficients are tuned to not create extreme dependency only to routing losses but keep the model’s focus on the classification loss, but at the same time ensure the token distribution among experts are close to equal. Within each step of the model a dropout layer is used to ensure regularization and prevent overfitting (Srivastava et al., 2014). This is done after each self-attention, FFN layers and for the input to each encoder block. Consecutive three MoE encoder blocks are formed and the output of the last encoder block is taken as the output of the transformer block. The transformer block produces output embedding in shape of number of time tokens plus two and model dimension. The last component of the first dimension is the CLS token which represents the classification result. A linear layer is formed for this token that maps the output from model dimension to number of classes. These logits created by the linear layer are the final results or predictions of the model for each class. This token is used to calculate the classification losses by comparing them to the original diagnosis values. 34 In total the model has 1,642,046 parameters, however, for any MoE model, the number of active parameters depends on the number of experts. Since only a single expert is active in the inference phase, an active parameter count is also reported. Excluding some common layers such as 1D CNN layers or normal transformer encoder blocks, only the MoE expert layers are sparse. In the inference phase, the model has 973,070 active parameters, having a parameter efficiency of 59.3 % for the whole model. Training Training of the networks that are large in terms of number of parameters requires great attention to stability issues while making sure the model is learni