ISTANBUL TECHNICAL UNIVERSITY  GRADUATE SCHOOL 

M.Sc. THESIS 

JULY 2025 

MULTI-LABEL CLASSIFICATION OF 12-LEAD ECG SIGNAL  
USING A MIXTURE-OF-EXPERTS TRANSFORMER MODEL 

Atalay ÇELİK 

Department of Data Engineering and Business Analytics 
 

Big Data and Business Analytics Programme 
 

Department of  Data Engineering and Business Analytics 
 

Big Data and Business Analytics Programme 

 
JULY 2025 

ISTANBUL TECHNICAL UNIVERSITY  GRADUATE SCHOOL 

MULTI-LABEL CLASSIFICATION OF 12-LEAD ECG SIGNAL  
USING A MIXTURE-OF-EXPERTS TRANSFORMER MODEL 

M.Sc. THESIS 

ATALAY ÇELİK 
 (528211093) 

Thesis Advisor: Asst. Prof. Mehmet Ali ERGÜN 


Veri Mühendisliği ve İş Analitiği Anabilim Dalı 
 

Büyük Veri ve İş Analitiği Programı 

 
TEMMUZ 2025 

İSTANBUL TEKNİK ÜNİVERSİTESİ  LİSANSÜSTÜ EĞİTİM ENSTİTÜSÜ 

UZMANLARIN KARIŞIMI BAZLI DÖNÜŞTÜRÜCÜ MODELİ İLE 12 
KANALLI EKG SİNYALİNİN ÇOK ETİKETLİ SINIFLANDIRILMASI 

 
YÜKSEK LİSANS TEZİ 

Atalay ÇELİK 
(528211093) 

Tez Danışmanı: Dr. Öğr. Üyesi Mehmet Ali ERGÜN 
 

v 

 
Thesis Advisor :  Asst. Prof. Mehmet Ali ERGÜN  .............................. 
 Istanbul Technical University  

Jury Members :  Assoc. Prof. Dr. Ömer Faruk BEYCA ............................. 
Istanbul Technical University 

Asst. Prof. Merve ŞAHİN  .............................. 
Ibn Haldun University 

Atalay Çelik, a M.Sc. student of İTU Graduate School student ID 528211093, 
successfully defended the thesis/dissertation entitled “MULTI-LABEL 
CLASSIFICATION OF 12-LEAD ECG SIGNAL USING A MIXTURE-OF-
EXPERTS TRANSFORMER MODEL”, which he prepared after fulfilling the 
requirements specified in the associated legislations, before the jury whose signatures 
are below. 
 

Date of Submission : 30 May 2025 
Date of Defense : 1 July 2025 
 

vi 

  
vii 

 
To my dear wife and family, 

 
viii 

 
ix 

FOREWORD 

Because of my main interest area of LLMs I have researched and worked with 
transformer architectures a lot. I have been curious about implementing this 
architecture to fields other than LLMs. One of the fields I wanted to implement was 
the time-series based domains.  

Although the healthcare and signal processing field are not my expertise areas, the use 
case of 12-lead ECG classification caught my attention as a possible applicable area. 
Adapting to these new areas has been challenging within the strict time period. My 
lack of expertise in these areas slowed me down during the study. Despite these 
challenges, I am happy with the results of the study. Working with a model architecture 
from dataset level to the training level has been a valuable learning experience and a 
tough challenge. 

I want to thank my advisor who helped me with the topic selection and advised me 
when I struggled with specific tasks in my thesis.  

The thesis has become a part of my marriage, which came to an end now. I want to 
specially thank my wife who has been supportive and patient towards me and my 
thesis. I also want to thank my family who has been encouraging me from afar. 

 
May 2025 
 

Atalay ÇELİK 
 

x 

 
xi 

TABLE OF CONTENTS 

Page 

FOREWORD ............................................................................................................. ix 
TABLE OF CONTENTS .......................................................................................... xi 
ABBREVIATIONS ................................................................................................. xiii 
SYMBOLS ................................................................................................................ xv 
LIST OF TABLES ................................................................................................. xvii 
LIST OF FIGURES ................................................................................................ xix 
SUMMARY ............................................................................................................. xxi 
ÖZET  ............................................................................................................. xxiii 

 INTRODUCTION .................................................................................................. 1 
 Information About Electrocardiogram ............................................................... 1 
 Diagnostic Challenges ........................................................................................ 3 
 Machine Learning-Based Approaches ............................................................... 3 
 Transformer-Based Models ................................................................................ 5 
 Purpose of the Thesis ......................................................................................... 6 

 LITERATURE REVIEW ...................................................................................... 7 
 Datasets .............................................................................................................. 7 
 Traditional Machine Learning Approaches ........................................................ 7 
 Deep Learning-Based Approaches ..................................................................... 8 
 Transformer Architecture: Foundation and Evolution ....................................... 9 
 Transformer Applications in ECG Classification ............................................ 10 
 Mixture-of-Experts Models .............................................................................. 13 
 Mixture-of-Experts for Time Series Classification .......................................... 14 
 Mixture-of-Experts Applications in ECG Analysis ......................................... 14 
 Mixture-of-Experts Transformer for ECG Classification ................................ 14 

 METHOD ............................................................................................................. 17 
 Dataset .............................................................................................................. 17 
 Preprocessing ................................................................................................... 20 
3.2.1 Outlier analysis ......................................................................................... 23 
3.2.2 Label encoding .......................................................................................... 25 
 Feature Extraction ............................................................................................ 26 
 Model Architecture .......................................................................................... 27 
3.4.1 Signal projection and early feature fusion ................................................ 28 
3.4.2 Transformer ............................................................................................... 29 
3.4.3 Mixture-of-expert ...................................................................................... 32 
 Training ............................................................................................................ 34 
3.5.1 Evaluation metrics ..................................................................................... 36 

3.5.1.1 Macro and micro AUC ....................................................................... 36 
3.5.1.2 Top-K accuracies ............................................................................... 36 
3.5.1.3 Precision, recall and F1 score ............................................................ 37 
3.5.1.4 Exact match ........................................................................................ 37 
3.5.1.5 PhysioNet challenge metric................................................................ 37 

3.5.2 Experimentation ........................................................................................ 38 


xii 

 RESULTS .............................................................................................................. 43 
 Results of the Proposed Model ......................................................................... 43 
 Comparisons ..................................................................................................... 45 
4.2.1 BiLSTM .................................................................................................... 46 
4.2.2 CNN .......................................................................................................... 46 
4.2.3 Transformer ............................................................................................... 47 
4.2.4 Comparison results .................................................................................... 47 

 CONCLUSIONS AND RECOMMENDATIONS ............................................. 49 
REFERENCES ......................................................................................................... 51 
APPENDICES .......................................................................................................... 59 

APPENDIX A: Dataset Information ...................................................................... 60 
CURRICULUM VITAE .......................................................................................... 63 
 
 
xiii 

ABBREVIATIONS 

AI : Artificial Intelligence 
ANN : Artificial Neural Network 
AUC-ROC : Area Under The Curve - Receiver-Operating Characteristic Curve 
BCE : Binary Cross Entropy 
BERT : Bidirectional Encoder Representations From Transformers 
BiLSTM : Bi-Directional LSTM 
CAD : Coronary Artery Disease 
CLIP : Contrastive Language-Image Pre-Training 
CNN : Convolutional Neural Networks 
CPSC : The China Physiological Signal Challenge 
CVD : Cardiovascular Disease 
DAE : Denoising Autoencoders 
DNN : Deep Neural Network 
ECG : Electrocardiogram 
FFN : Feed Forward Neural Network 
FIR : Finite Impulse Response 
G12EC : Georgia 12-Lead ECG Challenge 
GELU : Gaussian Error Linear Unit 
GPT : Generative Pre-Trained Transformer 
GPU : Graphics Processing Units 
GRU : Gated Recurrent Unit 
IIR : Infinite Impulse Response 
KL : Kullback-Leibler 
k-NN : K-Nearest Neighbour 
LDA : Linear Discriminant Analysis 
LLM : Large Language Model 
LSTM : Long Short Term Memory 
MAD : Mean Absolute Deviation 
ML : Machine Learning 
MoE : Mixture-Of-Experts 
NLP : Natural Language Processing 
PCA : Principal Component Analysis 
PPG : Photoplethysmography 
PSD : Power Spectral Density 
ReLU : Rectified Linear Unit 
RF : Random Forest 
RNN : Recurrent Neural Networks 
RoPE : Rotary Positional Embedding 
SHAP : Shapley Additive Explanations 
SNR : Signal-To-Noise Ratios 
SVM : Support Vector Machine 
ViT : Vision Transformers 
WFDB : Waveform Database 


xiv 

 
xv 

SYMBOLS 

dk  : Head Dimension 

Kh  : Key Matrix 

mV : Millivolts 

Qh : Query Matrix 

Vh  : Value Matrix 

  
xvi 

 
xvii 

LIST OF TABLES 

Page 

Table 3.1 : Dataset information. ................................................................................ 18 
Table 3.2 : Merged diagnoses groups........................................................................ 19 
Table 3.3 : Statistics of datasets. ............................................................................... 19 
Table 3.4 : Outlier distribution for each dataset. ....................................................... 25 
Table 4.1 : Evaluation results of proposed model for test and validation set. .......... 44 
Table 4.2 : Number of records for 5 most frequent diagnoses and F1 scores. .......... 45 
Table 4.3 : Number of records for 5 least frequent diagnoses and F1 scores. .......... 45 
Table 4.4 : Evaluation metrics for model comparisons. ............................................ 47 
Table A.1 : Class distribution. ................................................................................... 61 
 
  
xviii 

 
xix 

LIST OF FIGURES 

Page 

 Annotated ECG signal waveforms (Rawshani, n.d.-a)............................ 2 
Figure 3.1 : Co-occurence heatmap of top 10 diagnoses. ......................................... 20 
Figure 3.2 : Outlier examples for different detection methods. ................................ 23 
Figure 3.3 : External features extracted from ECG record with Neurokit2. ............. 27 
Figure 3.4 : Preprocessing pipeline and model architecture. .................................... 28 
Figure 3.5 : Details of the transformer architecture, inputs and outputs. .................. 30 
Figure 3.6 : Internal structure of mixture-of-experts module. .................................. 32 
Figure 3.7 : Comparison of total loss and F1 scores for model dimensions. ............ 39 
Figure 3.8 : Comparison of auxiliary loss and F1 scores for number of experts. ..... 39 
Figure 3.9 : Comparison of total loss and F1 scores for different projections. ......... 40 
Figure 3.10 : Comparison of total loss and F1 scores for learning rates................... 40 
Figure 3.11 : Dead neuron percentage plots for different encoder layers. ................ 41 
Figure A.1 : Example header file. ............................................................................. 60 
Figure A.2 : Distribution of number of diagnoses by dataset. .................................. 60 
  
 
xx 

 
xxi 

MULTI-LABEL CLASSIFICATION OF 12-LEAD ECG SIGNAL USING A 
MIXTURE-OF-EXPERTS TRANSFORMER MODEL  

SUMMARY 

Electrocardiogram (ECG) measures the electrical activity of the heart and is an 
important indicator for the detection of cardiac abnormalities. In general, ECG signals 
are measured from 10 different nodes, and 12 different leads are derived from these 
measurements. This is done to capture activity from different angles of the heart, 
therefore each lead captures different information about cardiac rhythm. Some 
diagnosis types can be detected with detailly analyzing these ECG records by looking 
at the specific characteristics of the signal such as peaks, transitions between peaks. 
Abnormal patterns in these signals can only be detected by the experts in the domain, 
although there are still challenges in the implementation phase.  

Automation of this process has been in the interest of researchers for a long time. Most 
early work focuses on the peak detection and beat classification tasks. The 
advancements in machine learning and deep learning have saturated and introduced 
successful implementation cases which have high accuracy and effectiveness for these 
tasks. As a more complex and sophisticated problem, automatic detection of the 
diagnosis is a common task which is being studied. Competitions such as PhysioNet 
CinC competition targeted this domain for several years and achieved great interest 
and results. The availability of datasets for ECG records accelerated these research 
interests. Several great datasets are open-source and available for research purposes. 

With the advancements in the large language model domain with the new model 
architecture called transformers becoming more prevalent, new approaches to the 
problem have emerged. Several studies implement transformer-based models for the 
ECG classification task. The embedded attention module in the model and high 
compute capability creates a great potential for signal computation. 

The usage of these models are still scarce and in development for time-series based 
data. Deep learning based methods such as long short term memory or gated recurrent 
units dominate and are commonly used. Even though there are great studies being done 
with transformer architecture, they mostly focus on forecasting based solutions for 
fields such as finance and weather forecasting.  

Another branch of transformer-based models is the mixture-of-expert approach where 
multiple experts are introduced within the model where the activation of these experts 
are controlled based on the incoming data. As in the making of this study, literature 
lacks implementation use cases with this model characteristics.  

This study aims to implement a mixture-of-experts based transformer model for the 
classification of ECG records into multiple diagnoses in a multi-label manner. Each 
record has 10 seconds of ECG record in 500 Hz frequency with demographic features 
available. There are a total of 26 different diagnosis labels and each record have one 
or more labels attached to it. These labels are the target which is being predicted. For 
this study 81926 different labeled ECG records are used from six different datasets.  


xxii 

For the preprocessing and outlier analysis of the datasets, a digital signal filtering 
approach is used with finite impulse filtering and each record is normalized. For the 
preprocessing steps, different configurations are tested and the most optimal parameter 
set is chosen according to signal-to-noise ratio. For the outlier analysis, a triple voting 
system is used with three different methods; Z-score, principal component analysis 
and isolation forest. The records which receive 2 out of 3 votes are removed from the 
dataset.  

Another important step is to extract external features from the ECG records to feed 
into the model. In this study, several methods are used to extract features such as peaks 
and offset values.  

The model is constructed by feeding the signal values from 12 different channels of 
ECG and the extracted signal features. These features then are concatenated with 
demographic features. Signals and the features are fed into the model with 1D 
convolutional layers to enhance the time-dependent features. All of these features are 
projected into the model. The main model block includes normal encoder blocks with 
self-attention layers, pre-layer normalizations and skip connections. After three usual 
encoder blocks, three special encoders with mixture-of-experts blocks are used. This 
main transformer model attends to important information between time tokens. 
Mixture-of-experts modules have special gating networks which route the time tokens 
to different experts depending on their characteristics. 

The training setup is carefully designed to experiment with different configurations. 
Each parameter is optimized with a uniform subset of all data. The main model is 
trained with all data with optimized configuration for 20 epochs with learning rate of 
3e-3 with cosine annealing. Batch size of 16 is used with gradient accumulation steps 
of 16, making the effective batch size as 256. The training runs use dropout, warm up 
steps and early stopping for effective training. The model inner dimension is set as 96 
and the feed-forward network as 384, there are 6 encoders and 6 heads for each 
attention head. The mixture-of-experts are composed of 4 experts with top-1 routing.   

For testing and evaluating special metrics for the task is constructed. The trained model 
on 80 % of the data as a training set is optimized on a 10 % validation set and tested 
on the 10 % test set. The tests resulted in  59.98 % macro F1 score, 54.17 % exact 
match score, 55.33 % top-1, 75.22 % top-2, 84,80 % top-3 accuracies and 95.74 % 
macro AUC-ROC value are achieved on 6 different dataset and 26 diagnosis with 
multi-label. 

The model achieves a great result for a difficult scenario of ECG classification task. 
The task requires a high level of expertise and is a complex problem with different 
facets. 

  
xxiii 

UZMANLARIN KARIŞIMI BAZLI DÖNÜŞTÜRÜCÜ MODELİ İLE 12 
KANALLI EKG SİNYALİNİN ÇOK ETİKETLİ SINIFLANDIRILMASI 

ÖZET 

Elektrodiyagram (EKG) kalbin elektriksel aktivitesini ölçer ve kalp hastalıklarının 
tespitinde önemli bir araçtır. EKG verileri hastalardan 10 farklı elektrot ile 
toplanmaktadır, bu veriler sonrasında yapılan hesaplamalar ile 12 farklı kanal için veri 
oluşturulmaktadır. Bu kanallar kalbin farklı açılardan elektriksel aktivitesini 
belirtmektedir. EKG verileri kalp hastalıklarının teşhisinde yoğun olarak 
kullanılmaktadır. EKG sinyalleri kalp atışına bağlı olarak karakteristikleri bilinen 
elektriksel sinyallerden oluşmaktadır. Bu sinyaller farklı kanallarda farklı davranışlara 
yol açmaktadır. P, Q, R, S, T gibi dalga tipleri, sinyallerin karakteristik bölümlerini 
tanımlamak için kullanılan terimlerdir. Bu dalgaların konumları, sinyal değerleri, 
birbirlerine göre konumları teşhis için önemli bilgiler içermektedir. Ancak 12 farklı 
kanaldan gelen veriler için doğru tanıyı koyabilmek için değerlendirici farklı 
uzmanlıkları ve yöntemleri kullanmak zorundadır. Tanıların doğru bir şekilde 
konulabilmesi için yüksek derecede literatür bilgisine ve deneyime sahip olunması 
gerekmektedir. Doğru teşhisler ancak alanda uzman doktorlar tarafından bu verilerin 
detaylı incelenmesi sonucunda koyulabilmektedir. 

EKG verilerinden kalp hastalıkları için tanı konulma sürecinin otomatik bir hale 
getirilmesi araştırmacılar tarafından ilgi ile araştırılan bir konudur. Konu hakkındaki 
öncül çalışmalar EKG verileri üzerinden tepe noktaların tespiti ve atımların 
sınıflandırması gibi amaçlarla yapılmıştır. Ancak bu konular makine öğrenmesi ve 
derin öğrenme gibi yöntemlerdeki gelişmeler ile literatürde sıkça çalışılmış ve yüksek 
doğruluk oranları ile yeterli olgunluk seviyesine ulaşmıştır. Daha kompleks bir 
problem türü olan otomatik tanı koyma problemi ise hala çalışılan ve yüksek başarım 
ile tamamlanmamış bir alandır. Bu alandaki çalışmalar makine öğrenmesi gibi klasik 
yöntemler dışında daha gelişmiş model tiplerini de kullanarak problemi çözmeye 
çalışmaktadır. PhysioNet platformu tarafından düzenlenen CinC gibi yarışmalar ile 
birkaç senedir bu alanda araştırmacıları motive etmektedir. Bu çalışmalar kapsamında 
EKG sinyallerinin sınıflandırma problemi konusunda ilerlemeler katedilmiştir. Bu 
alandaki çalışmaları motive eden bir diğer sebep ise açık kaynak olarak paylaşılan veri 
setlerinin fazlalığıdır. Dünyanın farklı bölgelerinden araştırma amacıyla paylaşılan bu 
veri setleri araştırmaları hızlandırmaktadır. 

Büyük dil modelleri alanında dönüşüme sebep olan dönüştürücü tabanlı modellerin bu 
alanda da uygulanması sonucunda yeni yaklaşımlar uygulanmaktadır. Birçok farklı 
çalışma dönüştürücü tabanlı modellerin içerisinde yer alan öz-dikkat ve yüksek 
hesaplama niteliklerini kullanarak EKG sınıflandırma problemini çözmeye 
çalışmaktadır. 

Zaman serisi temelli veri setleri için dönüştürücü tabanlı modellerin kullanımı 
kısıtlıdır ve  uzun kısa süreli bellek gibi modeller daha yoğun bir şekilde 
kullanılmaktadır. Dönüştürücü tabanlı modeller zaman serisi anlamında genellikle 
tahminleme amacıyla finans, hava durumu tahmini gibi alanlarda kullanılmaktadır. 


xxiv 

Uzmanların karışımı (mixture-of-experts) modelleri büyük dil modelleri alanında 
farklı implementasyon örnekleri ile başarılı bir şekilde uygulanmış bir metottur. Bu 
metot için modeldeki hesaplama katmanları birden fazla uzmana bölünerek bir 
yönlendirme katmanı aracılığıyla gelen girdiler karakteristiğine göre uzmanlara 
yönlendirilmektedir. Yönlendirme fonksiyonu da modelin öğrendiği bir katman olarak 
belirlenmektedir. Uzmanlar arasında eşit bir şekilde girdilerin dağıtılması, uzmanların 
eşit kullanım oranlarına sahip olması amacıyla farklı kayıp fonksiyonları modelin 
genel kayıp fonksiyonuna dahil edilmektedir. Bu çalışmanın yapıldığı süre içerisinde 
EKG sınıflandırma alanında bu model çeşidi kullanılarak yapılan çalışmalar çok nadir 
bulunmaktadır. 

Bu çalışmada uzmanların karışımı bazlı bir dönüştürücü model kullanılarak bir veya 
daha fazla tanı etiketine sahip EKG verilerinin birden çok olabilecek şekilde 
sınıflarının tahminlenmesini amaçlamaktadır. Bu tahminler 10 saniyelik 500 Hz 
frekansına sahip 12 kanallı EKG verileri ve bu verilerin toplandığı hastanın yaş ve 
cinsiyet gibi demografik bilgileri kullanılarak yapılacaktır. Toplamda 26 farklı tanı 
bulunmaktadır ve her EKG kaydı en az bir tanı bulundurmaktadır. Tahminlenmesi 
amaçlanan etiketler bu tanılardır. Çalışmada toplamda altı farklı veri tabanından 
toplanan 81926 farklı etiketli veri kullanılacaktır. 

Verilerin ön işleme adımları için dijital sinyal işleme yöntemleri kullanılmıştır. 
Öncelikle verilerin standart bir uzunluğa sahip olabilmesi için her bir veri kaydı 10 
saniyelik ve 500 Hz frekansına sahip olacak şekilde kırpılmış ve yeniden 
örneklenmiştir. Sonrasında bu kayıtlarda yer alan elektrik nedenli gürültülerin 
önlenmesi için farklı filtreleme yöntemleri denenmiştir. Butterworth, sonlu dürtü 
yanıtı (finite impulse filter) ve sonsuz dürtü yanıtı (infinite impulse filter) filtre 
yöntemleri ve farklı frekans değerleri sinyal-gürültü oranlarına göre karşılaştırmalı 
olarak bir alt grup üzerinde denenmiştir. Daha kaliteli sonuç verdiğinden dolayı sonlu 
dürtü yanıtı ve 500 Hz üzerinde karar kılınmıştır.  

Bunun dışında uç örneklerin tespit edilmesi amacıyla üç farklı metod ile bir oylama 
sistemi kurgulanmıştır. Z-skor, temel bileşen analizi ve izolasyon ormanı yöntemleri 
kullanılarak her bir kaydın öznitelikleri incelenerek bazı sınır değerleri aşan veriler 
işaretlenmiştir. Her bir veri için oylama sonucunda 3 oydan 2’sini alan veriler veri 
setinden çıkarılmıştır. Bu sayede daha kaliteli bir veri içeriği elde edilmiştir. 

Literatürde de sıkça başvurulan bir farklı yöntem ise modele sinyal değerleri haricinde 
ek dış verilerin de beslenmesidir. Bu aşamada da bilinen yöntemler kullanılarak EKG 
kayıtlarının her bir kanalı için farklı dalga tiplerinin tepe, başlangıç ve bitiş noktaları 
gibi tespit edilmiş ve ek öznitelikler olarak kaydedilmiştir. 

Model mimarisi kompleks bir şekilde tasarlanarak zaman bazlı olan bu verinin 
detaylıca öğrenilebilmesini amaçlanmıştır. Öncelikle modele girdi olarak 
beslenebilmesi ve zaman bağlamındaki özniteliklerinin öğrenilebilmesi için 12 farklı 
kanaldan gelen EKG sinyal verileri ve çıkarılan öznitelikler bir boyutlu bir 
konvolüsyon ağından geçirilerek boyutu artırılmış bir veri temsili elde edilmiştir. Elde 
edilen bu veri temsili sonrasında projeksiyon katmanları ile birleştirilmiş ve modelin 
ana katmanı olan dönüştürücüya girdi olarak beslenmiştir. Bu iki veri kanalı haricinde 
demografik veriler de benzer bir şekilde projeksiyon yöntemi ile modele bu veriler ile 
birlikte beslenmektedir. Dönüştürücü bloğunda temel özniteliklerin öğrenilebilmesi 
için kodlayıcı katmanların yarısı normal ileri beslemeli sinir ağı ile, diğer yarısı ise 
uzmanların karışımı modülü içeren sinir ağları kullanılarak oluşturulmuştur. 
Dönüştürücü tabanlı model verinin kendi içerisinde bir dikkat mekanizması kullanarak 


xxv 

ilişki ağı kurmayı ve veriyi daha iyi bir şekilde anlamayı amaçlamaktadır. Uzman 
yapısı ise farklı karakteristikteki parçaların farklı uzmanlara yönlendirilmesi ile 
modelin öğrenme katmanlarını düzenlemesini amaçlamaktadır. Bu şekilde model 
seyrek (sparse) davranış gösterebilmektedir.  

Eğitim aşamasında kullanılacak olan metrikler ve konfigürasyonların takibi, yapılan 
deneylerin incelenmesi ve kayıtlarının tutulması için genel hatları ile bir sistem 
kurulmuştur. Bu sistem içerisinde optimal ayarlamaların yapılabilmesi için verinin 
homojen dağılmış bir alt kümesi alınarak deneyler yapılmıştır. Bu deneyler 
kapsamında öğrenme oranı, model gömme büyüklükleri, uzman sayısı gibi bazı temel 
parametreler belirlenmiştir. Model iç katman gömme oranı 96 olarak, tüm sinir 
ağlarının gizli katmanının büyüklüğü ise 384 olarak belirlenmiştir. Dönüştürücü 
modelinde 6 farklı kodlayıcı bloğu, her bir blokta 6 farklı baş, uzman sisteminde ise 
toplamda 4 uzman ve anlık olarak tek uzman kullanılmıştır. Belirlenen optimal bazı 
parametre konfigürasyonları kullanılarak altı farklı veri setinden toplanan tüm veri 
üzerinde 3e-3 öğrenme oranı ile 20 döngü eğitim gerçekleştirilmiştir. Öğrenme oranı 
Cosine Annealing yöntemi ile eğitim aşaması boyunca azaltılmıştır. Eğitim yığın 
büyüklüğü 16 ve gradyan birikimi 16 olarak belirlenmiştir. Efektif anlamdaki yığın 
büyüklüğü böylece 256 olmaktadır. Eğitim başlangıcında ısınma adımları ile öğrenme 
oranı doğrusal bir şekilde baz öğrenme oranına kadar artmaktadır. Modelin veriyi 
ezberlememesi amacıyla erken durdurma uygulanmaktadır. 

Eğitim sırasında sınıflandırma görevi için belirlenen özel metrikler kullanılarak 
değerlendirmeler incelenmiştir. Eğitilmiş model, toplam verinin %10’u kullanılarak 
oluşturulan test setinde test edilmiştir. Testler sonucunda % 59.98 F1 skoru, % 54.17 
tam eşleşme, % 55.33 top-1, % 75.22 top-2, % 84.80 top-3 doğruluk and % 95.74 
AUC-ROC değeri elde edilmiştir. Modelin anlam analizinin yapılabilmesi için örnek 
veriler üzerinde GradCAM ve dikkat katmanları kullanılarak analiz edilmiştir. 

Çalışma sonucunda model EKG tanı sınıflandırması gibi zor bir görev için iyi bir 
başarım göstermiştir. Bu verilerin yorumlanması ve değerlendirilmesinin bazı 
durumlarda uzmanlar arasında da objektif olarak hala değerlendirilemediği göz 
önünde bulundurulduğunda modelin başarım seviyesinin bu gibi uygulamalardaki 
başarım oranlarına göre daha düşük olmasının nedeni anlaşılabilmektedir. Bunun 
dışında 6 farklı veri setinin kullanılması, veri setlerinin etiketlenme standartlarının 
farklılığı, etiketleyen uzmanların deneyim seviyelerindeki farklılıklar ve yanlış 
etiketlemeler veri kalitesini etkileyen bazı etmenlerdir. 

 
xxvi 

 
1 

 INTRODUCTION  

Health care is one of the main implementation areas of artificial intelligence (AI) to 

efficiently and effectively diagnose, treat and research. These implementations are 

motivated by the wide potential practical use cases in healthcare, tasks which range 

from vision-based detections to genetic disease diagnosis. One of the main 

implementation areas is the diagnosis of chronic heart diseases. According to the 

World Health Organization (2024) and Centers for Disease Control and Prevention 

(2024) regardless of gender or racial and ethnic background, heart disease remains the 

leading cause of death. Given their substantial global impact on morbidity, mortality, 

disability, and healthcare costs, early diagnosis of cardiovascular diseases (CVDs), 

diabetes, and their risk factors is critical to implementing lifestyle modifications and 

providing timely treatment for affected individuals (Facciolà et al., 2022, p. 939). 

 Information About Electrocardiogram 

Electrocardiogram (ECG) is a measurement method to measure electrical activity in 

the heart. These electrical measurements are taken from multiple locations which are 

recorded as electrical voltages and then are digitized. When plotted these numerical 

values form waveforms called ECG records. These records help medical staff diagnose 

heart related conditions. Based on the detailed article of Cardiovascular Medicine, the 

sensors to measure these electrical signals are called electrodes which are placed on 

10 different locations on the body. These measurements taken from the electrodes are 

then used to calculate the 12 different values which are called leads. The signals of 

leads are calculated by taking an electrode as reference and another one or multiple as 

exploring electrodes. The 12-leads represent 12 different angles. Leads are mainly 

divided into two sections, limb leads and chest leads. Limb leads consist of six 

different leads, of which 3 are augmented. These are I, II, III, aVR, aVL and aVF; 

leads starting with ‘a’ letter represent the augmented leads. Augmented leads are 

calculated by taking the mean of two different leads. Limb leads measure the electrical 

activity in the frontal plane. In contrast, chest leads measure the electrical difference 


2 

in the horizontal plane. The six chest leads are called V1, V2, V3, V4, V5 and V6 

(Rawshani, n.d.-b). 

The heart’s movements are activated by the electrical signals it receives which form 

the signals as wave patterns for each heart beating cycle. The process of heart beat is 

composed of several stages. In the atrial depolarization stage, the depolarization signal 

travels through the SA node to atria which form the P wave. This minor peak is 

followed by an onset until the QRS complex begins. This interval is called the PR 

interval. In the ventricular depolarization stage, the QRS complex in the ECG signal 

is formed. The Q, R and S values represent the first negative deflection, first positive 

deflection and following negative deflection periods. These periods form the 

ventricular depolarization and contraction steps of the heart. Following the QRS 

complex, the ST segment covers when both ventricles are fully depolarized. The T 

wave occurs after the onset of the ST segment which represents the repolarization of 

the contractile cells. For leads V2-V4 there is also a U wave that is sometimes 

observed. The depictions of these waves can be seen in Figure 1.1. The values of these 

waves change depending on the lead number since they represent different angles of 

electrical activity (Klabunde, 2023; Rawshani, n.d.-c). 

 
 Annotated ECG signal waveforms (Rawshani, n.d.-a). 

In some occasions instead of a usual ECG signal measurement that ranges from 10 

seconds to 120 seconds a Holter monitor device is used. The Holter monitor is a 

portable ECG device that measures heart activity for 24 hours or more (Holter 

Monitor, n.d.). 


3 

 Diagnostic Challenges 

The 12-leads ECG record gives a general overview of the heart activity which is 

commonly used to detect several different diagnoses. These diagnoses are done by 

educated medical staff that are trained on interpreting these ECG signals. However, 

the inherent nature of the problem poses a significant challenge. The signal 

measurements are in millivolts (mV) and the signal is time-sensitive. On top of that, 

the number of types of diagnosis to consider is really high. There is a large amount of 

information in a 12-lead ECG wave signal. All of these factors make it difficult to 

interpret the signal clearly and effectively even by experienced staff. Even though 

there are great practices to understand and interpret the ECG records, there is not a 

simple way of implementing it. 

This also creates the problem of objectivity of interpretations where different clinical 

practitioners present different results because of the complexity of the data. According 

to a study by Dhutia and their friends, these gaps between interpretations of the same 

ECG records are present for not only inexperienced physicians but also for the 

experienced physician as well (2017, p. 7). The education about the ECG diagnosis is 

also a challenge in itself which is described in the work of Kashou et al. as only a 

fraction of the time is allocated to ECG interpretation among other fields in medical 

education (2020, p. 125). As a result of the small-scale measurements of the ECG 

signals the noise and variability of it can affect the diagnosis results dramatically. 

Noise sources such as power line interference, electrode contact noise, patient motion 

artifacts, electrical activity in muscles, signal processing hardware of software noise, 

quantization loss cause disturbance and changes in the profile of the original electrical 

activity (Clifford, 2006, p. 69). 

 Machine Learning-Based Approaches 

ECG records are a type of signal received from multiple sensors that measure the 

electrical voltage values. Since these signals span a time interval, the problem becomes 

a signal processing problem. Signal processing has been around for a long time, mainly 

emerging from the field of communication (Shannon, 1948, p. 379). The methods in 

this field have been developing and evolving with the increase in computational power 

since the advancements in the hardware specifications. Furthermore, with the rise of 


4 

supervised and unsupervised pattern recognition models, the signal processing 

problems are also tackled with these new models (Cooper & Cooper, 1964, p. 416). 

Today, machine learning or related technologies are increasingly used in the medical 

field. From clinical decision support systems to evaluation of public health records, 

healthcare has been a great example of an application area of machine learning (ML) 

(Alanazi, 2022, p. 3). Not only the increase in computational power, but also the 

availability of large datasets and data collection systems drive these application cases, 

since ML performs better in the presence of diverse data sources (Mazumder, 2024, p. 

8). Medical data in the form of signals are in need of advanced feature extraction 

methods because of their complex nature. ECG records are one of those cases where 

complex behaviors in the data can be observed. Without enough information, it 

becomes difficult to interpret the data. 

The problem of diagnosing the patient from their ECG record poses a great challenge 

which includes expertise in medical information, signal processing and machine 

learning or artificial intelligence-based methods. This interdisciplinary nature of the 

problem gives opportunity to apply a collection of advanced techniques together.  

After the surge of ML based applications, the field has moved to more complex and 

generalizable models called the deep learning-based approaches. These approaches 

can be scaled further to retain more information and contain more computational 

power within the model. For complex problems this advantage becomes the main 

reason for the performance gain over traditional ML methods. Traditional machine 

learning approaches excel at interpreting and estimating structured data such as 

financial data, whereas the deep learning-based approaches are much more useful for 

unstructured data such as image, speech and language (Sharifani & Amini, 2023, pp. 

3900–3901). Deep learning-based approaches are particularly useful for understanding 

the hidden patterns that need computational load to discover. Since the inherent model 

architectures scale deeper or wider, there is more room for computation then the 

traditional approaches. As can be seen with the work of Yu and Zhang, one other 

difference is that the need for feature extraction is done outside of the model prediction 

with methods like principal component analysis (PCA) or linear discriminant analysis 

(LDA) in traditional methods. Deep learning-based approaches apply the feature 

extraction process as part of the architecture design such as recurrent neural networks 

(RNN), convolutional neural networks (CNN) or transformer. (Yu & Zhang, 2022, p. 


5 

217, 219). The introduction of features from outside gives the model an external 

information feed and bias, whereas letting the model figure out the patterns in the data 

without intervention lets the model generate representations of data by itself. This 

attribute of deep learning-based methods causes them to behave in a complex way 

which makes it hard to interpret. A study in the medical field concludes that deep 

learning-based methods and models become much more difficult as the scale of the 

model increases which is necessary for the increase in the accuracy of the models. The 

work suggests using multiple modalities to interpret the model behavior and design 

interpretability methods as domain-specific use cases, for example for medical 

diagnosis (Teng et al., 2022, p. 2351). 

 Transformer-Based Models 

Development of deep learning did not only introduce a new way of approaching 

complex problems but also brought new possibilities in terms of architectural 

breakthroughs. After the surge of models in the field of image processing and 

recognition from basic CNNs to advanced architecture such as AlexNet (Krizhevsky 

et al., 2012), VGG (Simonyan & Zisserman, 2015), Inception (Szegedy et al., 2014), 

and ResNet (He et al., 2016), the abilities of the models expanded greatly. These 

developments drove the application areas of image recognition such as autonomous 

cars.  

The same architectural breakthrough happened for natural language processing. Tasks 

such as machine translation and language understanding were approached by 

specialized deep learning models, RNN (Rumelhart et al., 1986) and long short term 

memory (LSTM) (Hochreiter & Schmidhuber, 1997) mainly.  

An important work by Sutskever, Vinyals and Le introduced the encoder-decoder 

LSTM architecture for machine translation tasks, which achieved high accuracy 

compared to other methods. This work introduced the encoder-decoder architecture 

into natural language processing (2014). In the same time period, another important 

method is successfully applied to the machine translation area which is the attention 

mechanism. After these silent developments in the field, in the work of Vaswani et al. 

a combination of these methods such as residual connections from ResNet architecture, 

attention mechanism and encoder-decoder architecture, a new model type called 

transformers has emerged as a successful implementation in language understanding 


6 

(2017). This architectural breakthrough has brought attention to other fields from 

generative models to audio understanding and played the role of a catalyzer effect in 

the developments.  

The transformer architecture is mainly excelled at understanding time-sequences. 

Even though the most successful area of the implementation is language understanding 

and generative context, there are also use cases where time-dependent understanding 

is crucial. Areas such as weather forecasting (Kurth et al., 2023), and stock market 

prediction (Wang et al., 2022) are common areas where the architecture is being used.  

Even though the original transformer architecture stayed the same with minimal 

changes, a literature evolved around this topic to fully capitalize its potential with 

different configurations. One of these configurations is the mixture-of-experts 

approach proposed by Shazeer et al. where multiple experts are introduced into the 

model (2017). 

 Purpose of the Thesis 

The reliable and quick diagnosis of heart related conditions independent of the medical 

stuff and their experience can have significant impact. This is especially crucial in 

regions with limited access to experienced professionals for ECG interpretation. This 

research explores the challenge of diagnosing a patient with multiple conditions by 

using their 12-lead ECG recordings combined with demographic features such as age 

and gender.  

The proposed approach is going to classify the ECG recordings with multi-class labels 

by using a mixture-of-expert (MoE) time-series transformer architecture with feeding 

exogenous feature information about the ECG waves and demographic attributes. The 

implementation will include the state-of-the-art architecture practices.


7 

 LITERATURE REVIEW 

The ECG classification problem has been in the attention of researchers for a long 

time. The availability of data in the field and the importance of the task has been 

motivating different studies with various technologies and methods. In this literature 

review, the work that includes machine learning technologies are going to be included 

starting from the traditional machine learning methods to the more advanced methods 

which include transformer architecture.  

 Datasets 

For the ECG analysis and classification several open-source datasets are available. 

These datasets help the researchers build effective models with datasets with different 

distributions and characteristics such as race, gender and region. Some important 

datasets which are used commonly are INCART (Goldberger et al., 2000), MIT-BIH 

(Moody & Mark, 2001), PTB (Bousseljot et al., 2009), The China Physiological Signal 

Challenge (CPSC) (Ng et al., 2018), CODE 15% (Ribeiro et al., 2019), Chapman-

Shaoxing (Zheng, Zhang, et al., 2020), PTB-XL (Wagner et al., 2020), Ningbo (Zheng, 

Chu, et al., 2020), SPH (Liu et al., 2022).  

 Traditional Machine Learning Approaches 

Since discovery of hidden features and patterns are harder for traditional machine 

learning methods, studies include feature extraction methods from ECG data. In one 

study, wavelets are used to extract features from the ECG records, then the extracted 

features are used to train a support vector machine (SVM) to classify ECGs into six 

categories (Daamouche et al., 2012, pp. 343–344). Another study includes features 

such as QRS complex positions, RR interval, frequency features, and QS power and 

training an SVM classifier (Zidelmal et al., 2013, pp. 572–573). Li et al. proposed a 

five-level ECG signal quality classification method using a SVM trained on simulated 

data with added real ECG noise. The model is tested on both re-annotated and synthetic 

data (2014, pp. 437, 442). Diker et al. compared the SVM, artificial neural network 


8 

(ANN), k-Nearest Neighbour (k-NN) classifier models to classify ECG signals using 

morphological features (Diker et al., 2018). As seen in these studies, the feature 

extraction process had a great impact and importance for the classification process. 

Several studies also included methods such as Adaboost, Naïve bayes, Gaussian Naïve 

bayes, LDA, logistic regression and random forest (RF) (Celin & Vasanth, 2018; 

Hassaballah et al., 2023; Matin Malakouti, 2023; Pandey et al., 2020). 

As these studies show, a key limitation is the need to extract features externally from 

the models, which often prevents learning the temporal dependencies inherent in the 

time-sequence nature of ECG data. 

 Deep Learning-Based Approaches 

Transition into deep learning-based approaches from traditional machine learning 

approaches changed how the features are extracted and fed into the model. Early work 

on neural networks for ECG classification included more signal processing steps. In 

one work, the authors used temporal, morphological features and wavelets to classify 

ECG beat types with intra-patient method on MIT-BIH dataset where the train and test 

set are taken from different segments of the same patient with a simple neural network 

with one hidden layer (Das & Ari, 2014). Another work uses RNN, Gated Recurrent 

Unit (GRU) and LSTM to estimate a simple binary classification task on the MIT-BIH 

dataset (Singh et al., 2018). 

Using CNN architectures on time series data is also an efficient way of learning the 

local temporal information with 1D CNN architectures. Tan et al. used CNNs along 

with LSTMs to detect a type of diagnosis called coronary artery disease (CAD) with 

high accuracy (Tan et al., 2018). Another work by Yildirim, uses multiple levels of 

decomposition with wavelets to extract features from ECG data. These features are 

used with an LSTM model to classify the record into five different categories 

successfully on the MIT-BIH dataset (Yildirim, 2018). 

Deep neural networks are also successfully applied to classify ECG records. In one 

study, the authors carefully gathered and annotated 91,232 ECG records and used a 

34-layer deep neural network (DNN) architecture to classify the records into 12 

different diagnoses without using any advanced preprocessing steps such as Fouriour 

or wavelets (Hannun et al., 2019, p. 68). Specially annotated datasets are used 


9 

extensively in this field of research. The work by Ribeiro et al., used more than 2 

million labeled ECG recordings called CODE to estimate the diagnosis in 6 classes 

with high F1 scores with a model based on ResNet architecture as an end-to-end DNN 

model. The 15% of the dataset was then openly published as CODE-15% dataset 

(Ribeiro et al., 2019). Some studies focus on the explainability of the models, since 

the behavior of the models consisting of neural networks are harder than traditional 

methods. One work uses 1D CNN, Bi-directional LSTM (BiLSTM) and 2-D CNN in 

combination within a network on PTB-XL, CODE-15% and Chapman Arrhythmia 

datasets. The model is then analyzed with SHapley Additive exPlanations (SHAP) 

(Lundberg & Lee, 2017) and Grad-CAM++ (Chattopadhyay et al., 2017) to understand 

where the model looks at the ECG records (Ayano et al., 2024). Another work uses 

AlexNet based network to diagnose the beats on the MIT-BIH dataset into 5 different 

beat types with different strategies such as early, intermediate and late fusion. The 

work includes scalograms and phasograms to extract features from the recordings. The 

results are analyzed to understand to which parts of scalograms and phasograms does 

the model attend to understand the beat (Scarpiniti, 2024). 

 Transformer Architecture: Foundation and Evolution 

The architectures evolved to a different phase after the breakthrough work of Vaswani 

et al. which introduced the transformer architecture, carried the studies beyond 

traditional machine learning methods and deep learning-based models (2017). The 

model combines multiple concepts under the same architecture such as encoder-

decoder structure, self-attention mechanism, multi-head attention, positional encoding, 

layer normalizations, residual connections and feed forward networks. The model 

combines these known methods in a way that creates a robust model that allows time-

sequence tokens to attend to each other and carry information through the sequence. 

Even though the transformer architecture fundamentally remained the same, some 

tweaks have been proposed and adopted as a better alternative to some parts of the 

model. First advancement is been the change in the positional encoding from the 

simple positional embedding used in original architecture to rotary positional 

embedding (RoPE) which integrates relative positional information directly into the 

self-attention mechanism through rotary transformations applied to the query and key 


10 

vectors (Su et al., 2023). Another small change to the original architecture is the 

change of post-layer normalization to pre-layer normalization (Xiong et al., 2020). 

Transformer inspired a lot of work in the field of natural language processing (NLP) 

and created the new field of research around generative models. Even though the 

model mainly revolved around the field of NLP, the robust abilities of it also 

transitioned itself into the image processing and time-series based works. Dosovitskiy 

et al. implemented the architecture to image recognition tasks successfully, which has 

created the branch of architectures of transformers called vision transformers (ViT) 

which specializes in understanding the image. 

 Transformer Applications in ECG Classification 

One of the first implementations of transformer architecture to beat classification in 

the MIT-BIH dataset uses original transformer architecture and also the RR interval 

values are fused into the model before the final linear layer. The ECG data is fed into 

the model by tokenizing it with a 1D CNN architecture (Yan et al., 2019). Other simple 

implementations are also present that use the vanilla transformer to classify the 

heartbeats (Akan et al., 2023). Classification of ECG recordings has been an attractive 

field of study among researchers. PhysioNet, which organizes annual George B. 

Moody Challenges, organized a challenge called Classification of 12-lead ECGs: the 

PhysioNet/Computing in Cardiology Challenge 2020 (Alday et al., 2021). In this 

competition, the first placed team used a transformer-based model, where special 

hand-crafted features are selected according to a random forest model’s feature 

importance values. With the age and sex features, 22 features in total are selected as 

“wide” features. The “deep” feature extraction process includes an embedding part 

with convolution layers, and then the main transformer block where features about the 

raw ECG recording are extracted. These “wide” and “deep” features are then combined 

with a multi-label classification head to estimate the diagnoses (Natarajan et al., 2020). 

The same team has submitted another transformer-based architecture called waveform 

transformer in the 2021 PhysioNet/CinC challenge (Reyna et al., 2021), which projects 

multi-lead ECG recordings to the model by segmenting it to equal length segments 

and feeding the segments to the model by combining them with their positional 

encoding, the method is similar to the ViT embedding layer implementation (Natarajan 

et al., 2021). In another study, the authors used a two-branch system where each branch 


11 

had 1D CNNs to embed the data, which then is fed into the transformer model. In one 

branch the raw ECG data is given and in the other branch the RR interval values are 

given to the model, which are then concatenated (Atiea & Adel, 2022, p. 358). 

Embedding the ECG recording information with 1D CNN models has been used in 

another work. However, the work uses not only an encoder but also a decoder block 

which gets an input of object queries that are learnable parameters. The work focuses 

on the beat classification task and uses MIT-BIH dataset (Hu et al., 2022, p. 5). In a 

different study, ECG recordings are fed into the model with shifted time windows to 

save computational cost and let the model focus on local features in the recording. 

Moreover, the ECG recordings are split into patches to reduce the computation even 

more. The different length time windows are then given to different transformer blocks 

and concatenated in the end (Cheng et al., 2023, p. 10). 

Training types can change from supervised to alternative structures. According to one 

study, masking parts of ECG data and training a transformer to reconstruct the signal 

helps the model classify the signal better with learned spatiotemporal representations 

(Hu et al., 2023). Similarly, transformer architecture with CNN and denoising 

autoencoders (DAE) can be successfully applied for the heartbeat classification 

problem which is an alternative to usual supervised implementations (Xia et al., 2023). 

ViT based models achieve great results with the high capability of vision capabilities 

on ECG datasets. For example, one study finetunes the DinoV2 model on the 2 second 

ECG recording pictures of CODE-15% dataset, and it has reached high scores on test 

sets (Singh et al., 2023). Another work utilizing the ViT structure, uses recurrence 

plots, Gramian Anguler Field and Markov Transition Field in combination to classify 

the ECG signal into a simple binary classification problem (Indrasiri et al., 2024).  

Capitalizing domain knowledge is key in implementations. One work uses different 

embedding convolutional layers for limb and chest leads separately, and uses 2-D 

convolutional layers to fuse the features learned from all leads and use it as the value 

part of self-attention (Ji et al., 2024). Interesting work on generative pre-trained 

transformer (GPT) to classify the ECG recordings implemented by Fu et al., depicts 

the capability of an architecture combining the ECG records and their diagnosis text 

data to same latent space with a Contrastive Language-Image Pre-training (CLIP) like 

embedding (2024). A variation of the GPT model is used to train a base model for 

ECG and photoplethysmography (PPG) separately. The models do the prediction of 


12 

the next time stamp autoregressive. Authors studied the interpretability of the models 

extensively to understand the behavior of the model. Different self-attention heads 

specialize in important parts of the recordings such as R peaks, Q and P waves. The 

model is fine-tuned further for an atrial fibrillation classification task, then the model 

is analyzed again to understand the attended tokens (Davies et al., 2024). Other works 

including the generative structures forecast the ECG records with different time series 

foundational models, then benchmarked for their speed and performance (Ali et al., 

2024).  

Special configurations of the transformer model can be used for this task to effectively 

classify the diagnosis. Authors have used a bidirectional transformer with multi-scale 

convolutions as an architecture with a special context-aware loss. The work uses 

convolutions with different kernel sizes for embedding of the records, then the 

embeddings are further refined with a module to calibrate the channel-wise patterns. 

The calibrated feature map and their reversed form is fed into the bidirectional 

transformer for classification (El-Ghaish & Eldele, 2024). In another study, the authors 

converted ECG recordings into images and used them as the primary training data, 

along with features extracted from raw signal values. They trained models based on 

1D CNNs combined with GRUs or LSTMs, as well as 2D CNNs, ResNet, and Vision 

Transformers for arrhythmia classification, finding that the GRU and LSTM-based 

models outperformed both pretrained and scratch-trained ViT-based models (Apostol 

& Nutu, 2025). 

Foundational models for time series fields have been a researched in recent years such 

as Informer (Zhou et al., 2020), Reformer (Kitaev et al., 2020), FEDFormer (Zhou et 

al., 2022), PatchTST  (Nie et al., 2022), Autoformer (Wu et al., 2021), Crossformer  

(Zhang & Yan, 2023). An adoption of this type of models to the medical field has been 

studied by Wang et al., where a foundational model for the medical field is trained 

with multi-channel patching and multi- granularity, afterwards tested on five different 

medical datasets including an ECG dataset.  The performance on the PTB-XL dataset 

has not been successful compared to other foundational time series models (Wang et 

al., 2024). 


13 

 Mixture-of-Experts Models 

After the great success of transformer models in the domain of generative tasks, the 

research in the large language model (LLM) field has moved to different architectures. 

Some work focused on architectures with baseline models such as state-space models 

like Mamba (Dao & Gu, 2024). On the other hand, some work focused on developing 

on the transformer models by building upon it. One of the main architectures which is 

adopted by different LLM developers is the MoE models. The MoE models are special 

architectures introducing expert and router modules. The main study by Dao and Gu 

has been published before the original transformer work as a separate structure using 

LSTM modules with router and experts. The work proposed using a special module 

which routes each token with a trainable gating network to one or multiple experts 

which are simple feed forward layers. The module output is a sparse combination of 

the expert layers. The number of experts to be activated is decided beforehand, and the 

activation rates are controlled by importance and load losses. The importance loss 

helps the model to not converge to use only a single expert, and the load loss ensures 

the experts receive a similar number of tokens. The trained models with MoE 

architecture achieved high results in machine translation tasks (Shazeer et al., 2017, 

pp. 2, 5). 

The MoE models are then merged with transformer implementation. First, the training 

optimization of the architecture is studied by Lepikhin et al. to shard the MoE experts 

to multiple graphics processing units (GPU) to train a 600 billion parameter model 

(2020). Afterwards another work, different from the original MoE work, proposed to 

route the tokens to only a single expert with the architecture called the Switch 

Transformers.  This approach also introduced a capacity factor, which sets a fixed limit 

on the number of tokens each expert can process per batch. Tokens routed to an expert 

beyond this capacity limit are typically dropped, a mechanism designed to ensure 

balanced computational load and efficiency. The MoE models allowed the use of 

partial activation of the parameters in the inference to lower the required time to get 

output (Fedus et al., 2021). Zoph et al. refined the implementation further by 

introducing the z-loss to ensure the stability of the training runs. The same work 

concluded that using 1.25 capacity factor and top-2 expert routing is a successful 

implementation for sparse modeling in the LLM field (2022). 


14 

 Mixture-of-Experts for Time Series Classification 

After the spread of MoE model implementations in the field of LLMs such as Mixtral 

(Jiang et al., 2024), the implementations in the time-series domain also became a 

relevant topic for researchers. One of the first studies on MoE based models on time-

series data focuses on forecasting tasks. These tasks range from weather forecasting to 

transportation. Three separate models are trained, the largest model being a 2.4B 

parameter model with 1.1B activate parameters. The models are benchmarked against 

time-series forecasting foundation models and performed better at most datasets while 

remaining competitive with the activated parameter sizes (Shi et al., 2025).  

 Mixture-of-Experts Applications in ECG Analysis 

Time-series based MoE models in the medical field are in the attention of the 

researchers recently. The success of the MoE models in LLM field and 

experimentations in the time-sequence domain motivated the research in the field. Han 

et al., contributed a broad study that focuses on variable number of modalities and 

potential missing inputs. The proposed architecture FuseMoE handles inputs of a 

patient such as vital signs, ECG signals, clinical notes and Chest-X-rays. The 

variability in the input type being time-series, text and image introduces several 

challenges which are attempted to solve with a special architecture of MoE transformer 

and gating function. The models are benchmarked on several datasets and performed 

better with the new gating function (2024, pp. 2–4). 

Another important work uses MoE architecture for the CODE-15% dataset to classify 

the ECG signals into 6 diagnoses. The important distinction of the study is pre-training 

the gating network to classify the diagnoses into three distinct super categories to help 

the model distinguish the related diagnosis. The MoE model achieved a 84.96 % F1 

score on the CODE-15% test (Chaves et al., 2024). 

 Mixture-of-Experts Transformer for ECG Classification 

Most work in the field of ECG classification approaches the problem with machine 

learning, deep learning, or transformer-based models. Classification tasks in ECG 

records usually include classification of beats or the diagnoses. Beat classification 

tasks require simpler approaches and models to classify and the literature around it is 


15 

mostly saturated with work that achieves high levels of accuracy. On the other hand, 

the detection of diagnosis from the ECG records requires more sophisticated 

approaches and models since the problem is more complex as it involves the whole 

ECG record. Also, the number of possible diagnoses exceeds the types of beats. The 

literature around diagnosis detection is still in development with different approaches 

being explored.  

The success of mixture-of-expert models in the field LLMs motivated the researchers 

to implement it to different fields of work. However, for ECG classification tasks, 

there are currently only a few studies on MoE-based transformer model 

implementations.  

In this work, a mixture-of-expert transformer model for diagnosis classification of 

ECG records is proposed with external feature fusion. To the best of the author's 

knowledge, this study focuses on an approach for which only limited research exists 

in the literature to date.


16 

 
17 

 METHOD 

In the present study, the introduced method approaches the problem of ECG record 

diagnoses classification with a mixture-of-expert transformer model with external 

feature fusion. The work focuses on 26 different diagnoses types which are multi-class 

labeled from six different datasets. The aim of the study is to classify an ECG record 

by using its 10 second window and demographic features such as age and gender. 

 Dataset 

ECG classification task is a common task explored for a long time and has attracted 

interest among the research communities. The main encouragement and motivation for 

these studies has been the existence and prevalence of datasets in the domain. Unlike 

some fields where the absence of open-source datasets inhibits the research efforts, in 

the healthcare field ECG records of patients from different countries and 

characteristics exist as open-source datasets. Since the work requires domain expertise 

to interpret, label and validate the records, existence of these types of datasets which 

are labeled by domain experts has an immense value for researchers from different 

fields.  

One of the oldest and most used datasets in this field is the MIT-BIH Arrhythmia 

dataset which is primarily used for beat classification purposes (Moody & Mark, 

2001). Another important work for the diagnosis classification task is the PTB dataset, 

used extensively in the early work on the field, even up to this day. Even though the 

number of samples in the dataset is small, it is one of the main pieces which motivated 

studies  (Bousseljot et al., 2009). Different parties and hospitals gathered such datasets 

for different purposes. The platform called PhysioNet, responsible for the publication 

of different open-source datasets in the medical field, collected such datasets and 

formed a competition about the classification of ECG records. PhysioNet performs the 

collection, standardization, storage and provision of the dataset (Goldberger et al., 

2000). PhysioNet/Computing in Cardiology Challenge 2021 (Reyna et al., 2021) used 

multiple sources as databases, some open and some undisclosed for evaluation 


18 

purposes. In this work, some of the datasets that are available are used for training and 

evaluation purposes. All datasets are downloaded from the PhysioNet platform. The 

used datasets include CPSC, CPSC-Extra (Ng et al., 2018), PTB-XL (Wagner et al., 

2020), Georgia 12-lead ECG challenge (G12EC) (Alday et al., 2021), Chapman-

Shaoxing  (Zheng, Zhang, et al., 2020) and Ningbo (Zheng, Chu, et al., 2020) 

databases. The competition included PTB and St. Petersburg INCART (Goldberger et 

al., 2000) datasets; however, because of the difference of the sampling frequencies and 

low number of records, these datasets are excluded from the training and evaluation 

processes. The details of the datasets can be seen in Table 3.1. 

The datasets only include the ECG record sample varying in length and the 

demographic features of the patient and the label of the diagnosis. The headers and the 

records are all standard and are in Waveform Database (WFDB) library format 

(Moody, 2022). An example header can be seen in the Figure A.1. 

As mentioned before, the dataset labels are in multi-class format; hence, a patient may 

have been diagnosed with multiple labels at once. Different diagnoses can coexist in 

different combinations, thus making the problem a multi-class classification problem. 

In the WFDB headers, the labels are provided with standard SNOMED-CT (SNOMED 

International, n.d.) codes. These codes match a specific type of diagnosis. 

Table 3.1 : Dataset information. 

Dataset Name 
Number of 

Records 
Average Duration 

(s) 
Frequency 

(Hz) 
PTB 116 113.63 1000 
PTB-XL 21593 10.00 500 
Chapman Shaoxing 9708 10.00 500 
CPSC 2018 5274 15.40 500 
CPSC 2018 Extra 1277 16.36 500 
Georgia 9456 9.98 500 
Ningbo 34468 10.00 500 
St. Petersburg INCART 34 1800.00 257 

The competition only scores part of the diagnosis. Because of the low number of 

occurrences among all the datasets, some diagnoses are excluded from the scoring to 

focus on diagnoses which have a considerable amount of samples. In the original 

datasets a total of 103 diagnoses are present, of which only 30 of them are scored. 

Since some diagnoses are similar to each other in terms of characteristics, the challenge 

owners decided to merge some of the diagnoses as a single diagnosis (Reyna & Sadr, 


19 

2021). The merged diagnosis can be seen in Table 3.2. After merging similar ones, 26 

unique remain.  

Table 3.2 : Merged diagnoses groups. 

Merge Group Original Label Name SNOMED-CT Code 

1 
Complete Left Bundle Branch 

Block 
733534002 

Left Bundle Branch Block 164909002 

2 
Complete Right Bundle Branch 

Block 
713427006 

Right Bundle Branch Block 59118001 

3 
Premature Ventricular 

Contractions 
427172004 

Ventricular Premature Beats 17338001 

4 
Premature Atrial Contraction 284470004 

Supraventricular Premature Beats 63593006 

Before filtering the unscored diagnosis, all datasets include 88212 ECG records. After 

filtering only for the 26 unique diagnoses, 6286 diagnoses are filtered, leaving 81926 

records which have a total of 129.307 diagnoses labels. The distribution of labels are 

in Table A.1. Each record has age and gender information regarding its patient in the 

header file. These demographic statistics and number of distinct labels can be seen in 

Table 3.3 for each dataset. 

Table 3.3 : Statistics of datasets. 

Dataset 
Average 

Age 
Number of 

Males 
Number of 

Females 

Distinct 
Number of 
Diagnoses 

PTB 45.82 85 31 5 
PTB-XL 59.47 11229 10364 22 

Chapman-Shaoxing 60.36 5485 4223 19 
CPSC 2018 61.96 3011 2263 6 

CPSC 2018 Extra 66.72 751 526 20 
G12EC 60.67 5094 4362 23 
Ningbo 58.08 19479 14975 25 

St. Petersburg INCART 58.26 20 14 9 

Of these databases, PTB and St. Petersburg INCART are excluded, leaving 81776 

ECG records. Out of all the records 60.7% have a single class label, and 25.64 % have 

two class labels. Distribution of number of diagnoses by dataset can be seen in Figure 

A.2. 

The top diagnosis when all datasets are taken together is sinus rhythm which 

corresponds to a normal ECG record. However, there still could be different labels 


20 

other than sinus rhythm. Some diagnoses occur together more than others, whereas 

some coexist together rarely. The co-occurrence of diagnosis can be seen in Figure 

3.1. 

 
Figure 3.1 : Co-occurence heatmap of top 10 diagnoses. 

 Preprocessing 

Having clean data in training any neural network improves model performance. Since 

the model learns from the patterns of the data, if the data has any inconsistencies in it, 

the learned representation could be flawed; hence, causing the model to misinterpret 

it.  

In the context of this study, the preprocessing and preparation of the ECG records is 

an important step of preparing the dataset for training. Filtering and standardization of 

record, detection of outlier records and preparation of data for training are analyzed 

extensively.  

First of all, after the retrieval of the dataset from the PhysioNet database, the records 

are read with the scipy.io.loadmat function in Python, as instructed in the competition 


21 

manual (Alday et al., 2021; Reyna et al., 2021). As described beforehand, each record 

has its own length and frequencies, resulting in a different number of samples. The 

selected datasets for the study only include records that have 500 Hz frequency rate, 

however the length of the records change. The ECG signals are collected from 12 

different channels as described in Section 1.1. These channels called leads cause a 12-

lead ECG signal.  

To have a consistent input sample size, all records are padded or cropped accordingly. 

Since all of the records are in 500 Hz, no action is taken in terms of resampling. Most 

records in the dataset are in 10 second intervals. In some cases the record can be longer 

or shorter in duration, resulting in an incorrect number of samples per record. 

Therefore, if a record is shorter than 10 seconds, it is padded with zeros and if it is 

longer than 10 seconds a window is cropped randomly to shorten the duration to 10 

seconds. This operation is performed on each channel, resulting in 5000 sample points 

in time domain per record for all 12 signal channels. 

Standardizing the number of sample points is crucial for feeding the records to the 

model. However, the more important part is to have a stabilized record which is 

cleansed of different noise sources. In the signal processing domain, this step is done 

by the signal filtering methods used extensively for frequency-based signals. As listed 

before, various sources can be the cause of such noise in the signal (Clifford, 2006, p. 

69). In the context of this study, denoising ECG signals is a critical part of the 

preprocessing pipeline. Filters such as Butterworth, Infinite Impulse Response (IIR), 

and Finite Impulse Response (FIR) are commonly used in this step. The Butterworth 

filter is applied for its smooth response and ability to suppress high-frequency noise 

without introducing significant distortion (Butterworth, 1930). IIR filters, known for 

their efficiency and analog-based design, offer a practical solution for real-time 

filtering tasks. FIR filters, with their linear phase properties, are used in cases where 

phase integrity is essential. These filtering methods help ensure that the model is 

trained on cleaner and more consistent data, reducing the risk of misinterpretation 

caused by noise artifacts. 

To select the best alternative for filtering and sampling strategies, different 

configurations were compared by their respective signal-to-noise ratios (SNR) using a 

subset of 100 diverse ECG records. This analysis systematically evaluated various 

preprocessing configurations. For filtering, both FIR and Butterworth filters were 


22 

tested with a range of parameters, including different cutoff frequencies, filter orders, 

and numbers of taps. The performance of these filters was primarily assessed using the 

SNR, which was calculated from power spectral density estimates of the signals. 

Regarding the sampling frequency, the investigation analyzed the effects of multiple 

rates, such as 125 Hz, 250 Hz, 500 Hz, and 1000 Hz. Performance was judged on SNR, 

considering the trade-off with data size and computational load.  

In terms of filtering, while both FIR and Butterworth filters demonstrated capabilities 

in effective noise removal, the FIR filter was ultimately selected for this specific 

application. The chosen FIR configuration, derived from the analysis that maximized 

SNR, was determined to be suitable for preserving the critical morphological features 

of the ECG waveforms while effectively attenuating unwanted noise. Based on 

comprehensive assessment of the tradeoff between signal fidelity and data volume, a 

sampling frequency of 500 Hz was selected. This particular frequency was identified 

as providing a sufficiently detailed representation of the ECG signal necessary for the 

intended tasks, while also managing the computational load effectively. Therefore, the 

final preprocessing pipeline utilizes optimized FIR filtering and a 500 Hz sampling 

frequency, ensuring both high signal quality and processing efficiency. The optimal 

parameter set for the FIR filtering in terms of its number of taps and cutoff values are 

decided in the training process by checking different evaluation metrics and various 

literature studies, which will be described further.  

As a standard step in the preprocessing part, normalization of the records is performed 

for each channel separately using the Z-score normalization.  The normalized signal is 

calculated as in Equation 3.1, for ith lead at time t where ϵ is a small positive value. 

𝑍௜,௧ =
𝑋௜,௧  −  𝜇௜

 𝜎௜  +  𝜀
(3.1) 

As for the normalization of demographic features, the age values are normalized by 

dividing them with 100 assuming the maximum age is 100. The gender values are 

replaced with 0 and 1 for female and male. To handle the missing values for some of 

the age values, mean age values per dataset and gender are calculated and these missing 

values are replaced with averages. Some anomalies are handled with this approach. 

These anomalies include, 201 records in PTB-XL dataset that has 300 as their age 

value, 211 records in Ningbo dataset that has zero as their age value, 206 records in 


23 

various datasets that have “NaN” as their age value and 4 records in CPSC dataset that 

has -1 as their age value. As a result, a total of 622 records are replaced with mean age 

values depending on their dataset and gender.  

3.2.1 Outlier analysis 

Another important step in increasing the quality of the dataset is to detect the outlier 

records which have measurement anomalies. These outliers can cause the model to 

learn wrong representations. Although the outliers still have labels, the characteristics 

of the records do not represent the usual diagnosis representation. To do this, a multi-

method approach is used to detect the outliers. Three different methods are used and a 

majority voting of the methods is done to eliminate unwanted records. Some outlier 

examples can be seen in Figure 3.2. 

First, all records are first preprocessed with decided parameters as described before. 

To supplement the record with features two different feature sets are used; statistical 

features and frequency-based information. Statistical information include mean, 

median, standard deviation, minimum point, maximum point, 25 %, 75 % percentiles 

for every lead. For the frequency-based information the Power Spectral Density (PSD) 

of it the record is calculated with Welch’s method per lead. These features are fed into 

three distinct methods; Z-score method, PCA and Isolation Forest.  

Figure 3.2 : Outlier examples for different detection methods. 


24 

For the Z-score method, instead of using mean and standard deviation of these features, 

median and mean absolute deviation (MAD) values are used for the Z-score 

calculations for every feature.  The MAD is calculated as the median of the absolute 

differences between each feature value and its respective median. The modified Z-

score denoted as Z* is calculated as in Equation 3.2-3.4 where xik is the value of the kth 

feature for the ith record. This calculation is based on the work of Iglewicz and Hoaglin 

for the Z-score calculation to not be affected by few outliers (1993, p. 11). Since ECG 

records have period peaks which are higher than the average value of the record, 

modified Z-score is found to be a better alternative in this study. 

𝑥௞෦ = 𝑚𝑒𝑑𝑖𝑎𝑛൫𝑥ଵ,௞, 𝑥ଶ,௞, … , 𝑥ே,௞൯ (3.2) 

𝑀𝐴𝐷௞ = 𝑚𝑒𝑑𝑖𝑎𝑛൫|𝑥ଵ,௞  − 𝑥௞෦ห, |𝑥ଶ,௞  − 𝑥௞෦ห, … , |𝑥ே,௞  − 𝑥௞෦|൯ (3.3) 

𝑍௜,௞
∗ = 0.6745

|𝑥௜,௞  − 𝑥௞෦|

 𝑀𝐴𝐷
(3.4) 

Each record which has more than 25 % of their features modified Z-score higher than 

the threshold of 3 are eliminated from the dataset. 

For the PCA score, each feature set’s components are calculated, the number of 

components is decided until the explained variance of features by components is at 

least 95 %. After calculating the PCA components the record features are transformed 

and reconstructed with them. Then the reconstruction error per feature is calculated 

with mean squared error. If the Z-score of the errors is higher than the threshold of 3, 

then it is flagged as outlier. 

Isolation Forest is an ensemble learning algorithm which provides an effective method 

for anomaly detection by isolating observations. The core principle is that outliers are 

few and different and are therefore more susceptible to isolation. The algorithm 

constructs a forest of random trees. For each tree, data is partitioned by randomly 

selecting a feature and then randomly selecting a split value between the minimum and 

maximum values of that feature. Abnormal instances, being distinct, tend to be isolated 

in shorter paths within these trees, closer to the root. The proportion of expected 

outliers called the contamination factor is calculated according to the size of the 


25 

subgroup that is being analyzed with a maximum of 10 % and the number of trees is 

set to 100.  

Since the characteristics of each diagnosis is different, calculating the features within 

the specific diagnosis need to be performed to make sure some diagnoses are not 

eliminated because of its characteristics, and not because of its abnormality. However, 

diagnosis groups which have a low number of examples for the methods to be effective 

are thought as a single group. The lower threshold for this number of examples is 20. 

After the detection of outliers with all of the methods over all the datasets, records that 

received 2 out of 3 votes are eliminated from the dataset. In total 929 records are 

flagged as outliers and removed from the scored datasets. The outlier distribution per 

dataset can be seen in Table 3.4. 

Table 3.4 : Outlier distribution for each dataset. 

Dataset 
# of Normal 

Records 
# of Outlier 

Records 
Outlier 

Percentage 
chapman_shaoxing 10143 102 1,00% 

cpsc_2018 6778 93 1,35% 
cpsc_2018_extra 3387 65 1,88% 

georgia 10235 107 1,03% 
ningbo 34501 385 1,10% 
ptb-xl 21685 141 0,65% 

Grand Total 86729 893 1,02% 

The number of outliers that are discarded by the outlier removal process corresponds 

to nearly 1 % of all the records. The discard ratio is small enough to not effect the 

training process. 

3.2.2 Label encoding 

As the target of the model, the diagnoses information has to be provided to the model. 

A simple encoding approach is used to turn every label of record into a vector of 26 

elements, representing the 26 diagnosis types. Since this is a multiclass problem, for 

every record there exist one or more positive labels in each target vector. The positive 

labels are indicated with 1 in the target vector. 


26 

 Feature Extraction 

As stated in the Section 2, most work capitalize on extracting features from the raw 

records to enhance the performance of the models. Even though the models capture 

the semantic relationships between the parts of the records or any time-series sequence, 

giving external expertise information to the model navigates the model to capture the 

important parts of the record, hence increasing its performance.  

For the ECG time records, if every single time point is thought as a token, some tokens 

are important indicators or the places where any expert looks for to diagnose the 

patient. As conveyed earlier, ECG records have distinguishable points which are 

common on all the records. These are the peaks and dips in the record called P, Q, R, 

S and T points. Feeding these tokens directly to the model has been a primary method 

of using external information. In this work, these tokens are detected by using an ECG 

record feature extraction library called NeuroKit2 (Makowski et al., 2021). This library 

helps extract the useful features of the ECG record, which are the peaks, onsets and 

offset values of the specific points described. The library is not only used for ECG but 

for most biosignal processing and analysis. For each record and channel of the record, 

following features are extracted separately; onset, peak and offset value of point P, 

peak of point Q, onset, peak and offset of point R, peak of point S and onset, peak and 

offset value of point T. These features sum to 11 different features of different points.  

The points of these features are not necessarily a single point, but can be multiple 

points spanning the specific time section as seen in Figure 3.3. The detection of R-

peaks is done by the default settings of the Neurokit2 library, which is argued to be an 

efficient way (Brammer, 2020). In this study’s experimentations, the method detected 

R-peaks successfully. The other feature points are also detected by using the default 

arguments, which use discrete wavelet transform (Martínez et al., 2004). In the 

experimentations for this specific study, the method performs better than the 

alternative methods which are continuous wavelet transform and peak-based method.   


27 

 
Figure 3.3 : External features extracted from ECG record with Neurokit2. 

These features are computed for all the channels and time tokens in the record, hence 

there are 11 feature points for every corresponding 5000 time series tokens. The feature 

vector consists of boolean values indicating the presence or absence of a feature point. 

A special approach is taken to compute these vectors. Since the computation of these 

features will be common for all the experimental runs, the NeuroKit2 features are pre 

computed for all of the datasets and stored as sparse matrix formats beforehand to 

decrease the training times. 

 Model Architecture 

As the main objective of the study, a mixture-of-experts transformer model is used to 

classify the diagnoses. The model and training methods are adapted from the standard 

MoE architecture and practices from important studies (Fedus et al., 2021; Shazeer et 

al., 2017). For detailed explanation of the architecture each section of the model will 

be described separately. To explain the general flow of the model, the input of the 

model is the record itself from 12 different channels for 12 leads, 11 features 

precomputed for each channel and time point, the demographic features and the CLS 

token. The model output is the multi-class prediction logit vector for all 26 different 

diagnosis classes. These logits are then used to decide on the class predictions with 

optimized thresholds which will be explained further. The preprocessing steps, feature 

extraction and overall details about the model architecture can be seen in Figure 3.4. 


28 

 
Figure 3.4 : Preprocessing pipeline and model architecture. 

3.4.1 Signal projection and early feature fusion 

First of all, to fuse the information from the external features of ECG records, an early 

fusion method is used to fuse the features directly with the records. To enhance the 

information channels from both the signal input and external features, multiple 1D 

CNN layers are included into the model. This multi layered CNN architecture allows 

the model to use different kernel sizes to learn fine-grained information in the time 

domain and increase the channel size which the information is carried. For both the 

signal and external features the 1D CNNs are three layer structures. The ECG signal 

layers have consecutive kernel sizes of 51, 31 and 11, and channel sizes of 32, 64 and 

128. The external feature layers have consecutive kernel sizes of 201, 51 and 21, and 

channel sizes of 16, 32 and 64. The kernel sizes of external features are kept high since 

it is a sparse array activating only in peaks and offsets. The channel size is higher for 

the signal information since it inherently includes more information. These layers do 

not alter the time dimension by padding the tokens. Each 1D CNN is followed by a 

batch normalization layer and leaky Rectified Linear Unit (ReLU) activation function. 

For merging these channels a concatenation of the signal information and external 

features is done. This concatenation creates a new signal matrix with 5000 time tokens 

and 192 feature dimensions, 128 being the ECG signal’s leads and 64 being the 

extracted features.  

After this concatenation, to inject this information to the model a projection layer is 

used to scale the signal input to model dimension. Projection layers take the 

concatenated matrix as input and by using a linear layer outputs a matrix which is in 

shape of model dimension by sequence length. This projection is done to align signal 


29 

features to a single dimension which is the common model dimension used within the 

model. 

The signal features are then concatenated with the output of another similar projection 

layer which projects the demographic features of the patient to the same model 

dimension as a single vector in the time axis. Finally, a CLS token is included in the 

input by concatenating it in the time axis. Therefore, the input shape becomes 96 in 

feature dimension, same as model dimension, and 5002 in time dimension. The CLS 

token is used to carry the information for the classification decision of the whole 

model. Conveying this information with a single token is beneficial for the model to 

easily compress the information about the classification decision into a single vector 

instead of doing it along all the time sequence tokens. As a result, all signal information 

and external features are successfully fused and fed into the model. 

3.4.2 Transformer 

The core model architecture is similar to the original transformer architecture with 

small tweaks with modern implementations of some methods. The original model uses 

an encoder-decoder model since the task of translation requires a generative output. 

However, in tasks such as embedding models or general feature extraction of any type 

of information, using only the encoder block is sufficient. These types of models are 

called Bidirectional Encoder Representations from Transformers (BERT) models 

(Devlin et al., 2018). 

The transformer model is composed of 6 layers, 3 of which are normal encoder layers 

and the rest are mixture-of-expert encoder blocks. The normal encoder blocks take the 

input described above and use a pre-layer normalization method to normalize the input. 

The pre-layer normalization applies the layer normalization before the self-attention 

block instead of the post-layer normalization used in the original work, which is seen 

as a common practice to improve the performance and stability of the training (Xiong 

et al., 2020). The layer normalization implementation follows the implementation of 

Ba, Kiros and Hinton (2016). Detailed depictions of the used methods can be seen in 

Figure 3.5. 

After the normalization of the input, it is forwarded to a multi-head self-attention layer 

where each token attends to other tokens and computes the amount of information to 

flow to each token by using query, key and value vectors. Each token first computes 


30 

its attention score to each token by using its own query vector’s dot product to every 

other key in the sequence including itself. This computation results in an attention 

vector with logits which is then scaled by dividing by the square root of the dimension 

of the key vector. This scaling is done to stabilize the gradients in the training. To 

transform these attention scores to probability values a softmax function is used. These 

probabilities represent how much does each token attend to other tokens in the time-

sequence, which is an important information to understand the relationship in data.  

 
Figure 3.5 : Details of the transformer architecture, inputs and outputs. 

For this model, since there are also two extra tokens, demographic feature token and 

CLS token, these also attend to each time token and get attended to. This information 

flow is used within the model for updating the classification prediction and injection 

of demographic feature information. 

In the original work of the transformer, fixed sinusoidal absolute positional embedding 

is used to embed the positional information of the tokens. However, different authors 

suggested and implemented variations of positional embeddings for the stability of the 

representation of positions. Absolute embedding is considered insufficient in long 

context scenarios. The work which is adopted as the new standard, uses RoPE to 

embed the positional information. This method alters query and key values by 

changing their rotations before the calculation of attention scores. Change of the query 


31 

and key values provides a position information between the tokens relative to each 

other without introducing learnable parameters. Unlike absolute positional 

embeddings, the RoPE method does not degrade in a long context and has periodic 

understanding, which is a potential benefit in this task with a high number of tokens 

with a periodic characteristic. 

After embedding the positional information and calculating the attention scores, a 

softmax function is applied to these scores to obtain attention probabilities. These 

probabilities are then matrices multiplied with the value vector of each token. This 

results in a weighted sum of each value vector with attention probabilities.  

Mathematical operations of RoPE and self-attention are as seen in Equation 3.5 where 

Qh' represents query and Kh' represent key matrices after applying RoPE. Vh represents 

the value matrix which is not altered. The head dimension is represented with dk which 

is the model dimension divided by the number of heads. The RoPE implementation 

follows the study of  Su et al. (2023). 

𝐻𝑒𝑎𝑑௛ = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 ቆ
𝑄௛

ᇱ   (𝐾௛
ᇱ )்

ඥ𝑑௞

ቇ 𝑉௛ (3.5) 

This operation is done in multiple branches called heads. These heads cause the model 

to learn information in multiple representation spaces.  Each head has its own learnable 

linear projection matrices to compute query, key and value vectors, causing them to 

carry information in different ways and attend to different parts of the time-sequence. 

After the multi-head self-attention block, residual connections are used to carry the 

original input information further without change. This helps the model to always 

attend to the original input. After the residual connection, another layer normalization 

is applied and then a feed forward neural network (FFN) which is a two-layer neural 

network which goes from model dimension to hidden dimension and model dimension 

again. The hidden dimension, as a common practice, is 4 times the model dimension. 

After the first layer a leaky ReLU activation is applied. The second time, a residual 

connection is formed from the first addition step. This series of operations are done in 

a block called encoder. There are 3 consecutive encoder blocks. The depth of these 

blocks increases the amount of computation performed and number of parameters of 

the model; hence, help in storing more information in the model. 

 
32 

3.4.3 Mixture-of-expert 

A special encoder block with the same architecture but with a different module instead 

of the FFN is used for the rest of the 3 encoder blocks. These modules are called the 

mixture-of-expert modules. The MoE modules are composed of several experts and a 

router to route the incoming tokens to one or more experts. The MoE approach has 

been used in transformer architecture and is found to be beneficial for performance in 

language models. Layer normalizations, residual connections and the self-attention 

blocks are the same as the other encoder blocks. 

 
Figure 3.6 : Internal structure of mixture-of-experts module. 

In the MoE module the incoming tokens are routed by a router gate in the beginning. 

The router uses a simple learnable linear layer to project the model dimension to a 

number of experts to calculate the router logits. After this step, for every token the top-

k experts that have the highest routing logit are marked. Then, to route the tokens to 

experts, dispatch tensors are created with these markings. Each dispatch tensor acts as 

a mask to ensure the correct assignments to experts. The MoE module is depicted in 

Figure 3.6. In this study, a top-1 routing logic is used for the routing, however, other 

routing options are also experimented. For top-k routing options other than top-1 

routing, the routing logits normalized with softmax function is used to form the 

dispatch tensor which acts as gating weights for each expert for each token. Each 

expert is a simple FFN with the same dimension as in normal encoder blocks. 

Intuitively, each expert acts as an independent FFN that are sparsely activated, causing 

the model to have a dropout like structure for different FFN modules. 


33 

To ensure that tokens are uniformly routed to experts, different loss components are 

included in the loss function called the auxiliary loss. There are two loss components; 

Z-loss and load balancing loss.  

The load balancing loss is responsible for creating an equally distributed load across 

all experts. In training, for each batch the token distribution divergence from normal 

distribution is compared and the Kullback-Leibler (KL) divergence value is calculated. 

When there is a great discrepancy between a normal distribution and model’s expert 

distribution this loss becomes greater. This loss component is introduced first by 

Shaazer et al., however in this work, unlike their work, the KL-divergence score is 

used instead of coefficient of variation (Shazeer et al., 2017).  

The other loss component is z-loss introduced by Zoph et al. to regularize the routing 

logits. The z-loss is calculated by taking the square of the sum of each logit's log of 

exponentials and averaging them across experts (2022). This component acts as a 

regularization loss, keeping the router logits low and penalizing large logits. This helps 

the model to not give extreme routing probability to a single expert and cause an expert 

collapse. 

These two loss components are added to the model loss by multiplying them with some 

coefficient, for each one. These coefficients are tuned to not create extreme 

dependency only to routing losses but keep the model’s focus on the classification loss, 

but at the same time ensure the token distribution among experts are close to equal. 

Within each step of the model a dropout layer is used to ensure regularization and 

prevent overfitting (Srivastava et al., 2014). This is done after each self-attention, FFN 

layers and for the input to each encoder block. 

Consecutive three MoE encoder blocks are formed and the output of the last encoder 

block is taken as the output of the transformer block. The transformer block produces 

output embedding in shape of number of time tokens plus two and model dimension. 

The last component of the first dimension is the CLS token which represents the 

classification result. A linear layer is formed for this token that maps the output from 

model dimension to number of classes. These logits created by the linear layer are the 

final results or predictions of the model for each class. This token is used to calculate 

the classification losses by comparing them to the original diagnosis values. 


34 

In total the model has 1,642,046 parameters, however, for any MoE model, the number 

of active parameters depends on the number of experts. Since only a single expert is 

active in the inference phase, an active parameter count is also reported. Excluding 

some common layers such as 1D CNN layers or normal transformer encoder blocks, 

only the MoE expert layers are sparse. In the inference phase, the model has 973,070 

active parameters, having a parameter efficiency of 59.3 % for the whole model. 

 Training 

Training of the networks that are large in terms of number of parameters requires great 

attention to stability issues while making sure the model is learni