Time series classification via topological data analysis

thumbnail.default.alt
Tarih
2022-06-22
Yazarlar
Karan, Alperen
Süreli Yayın başlığı
Süreli Yayın ISSN
Cilt Başlığı
Yayınevi
Graduate School
Özet
This dissertation aims to demonstrate the power of Topological Data Analysis (TDA) and the subwindowing method for feature engineering in time series classification tasks. As an application, we used two publicly available datasets, WESAD and DriveDB. These datasets consisted of physiological signals collected under stressful and non stressful events. Furthermore, in order to assess the reliability of our methodology, we tested our feature engineering methods on a synthetic dataset that consists of artificial physiological signals mimicking a stress detection study. The results indicated that automatically created topological features can yield higher classification accuracies than signal-specific and hand-crafted features (such as heart rate derived from an ECG signal). In the first chapter of this work, we briefly summarize TDA and persistent homology. Also, the methods for time series classification via persistent homology is discussed, and we make a literature review on the subject. The second chapter is devoted to time series methods, and how we can classify them. We first define the method of sliding windows, and discuss why it can be useful in machine learning tasks. Then, we talk about time delay embeddings which transforms a univariate time series into a high dimensional dataset. We illustrate how the topology of the resulting dataset is affected by the delay parameter (also known as the embedding dimension). At the end of this chapter, we introduced the subwindowing methodology which solved the main problem of this work. We showed that this method allows us to reduce noise, improve computation time by a large amount, and use longer windows without incurring extra computational cost. In the third chapter, the theoretical background for TDA and persistent homology is given. The chapter starts with discussing why we should see the data at different scales. Then we give preliminary definitions related with simplices and simplicial complexes. We state the Nerve theorem and talk about how a topological space and a simplicial complex can be homotopy equivalent under some assumptions. This theorem tells us that the Cech (and therefore Rips) complexes are topologically similar to the underlying object that the dataset was sampled from. Later in this chapter, we define simplicial homology and show how we can compute the homology of a simplicial complex. Note that we need a fixed distance (epsilon) parameter to build a simplicial complex on top of a dataset. On the other hand, persistent homology allows us to investigate the persistence of homology groups when epsilon varies. After presenting how persistent homology works, we define persistence diagrams and two widely used metrics between them. Also, we show that persistence diagrams are stable under small perturbations of the data. Lastly, we show some means of performing feature engineering of persistence diagrams. The fourth chapter consists of the description of the datasets used in this dissertation and our methodologies. First, we introduce the three datasets (synthetic, WESAD and DriveDB) used in this study. For the synthetic dataset, there were two classes of physiological signals: stress and non stress. The classes for WESAD were baseline, amusement and stress. For DriveDB, the classes were relax, driving in the highway (low stress) and driving in the city (high stress). We then talk about the physiological signals included in the datasets, their sampling frequencies, and some preprocessing we did beforehand. Our experiments had some parameters such as window size, subwindow size, the embedding dimension in time delay embeddings. Later in this chapter, we discuss how these parameters were chosen, and how we did feature engineering for our experiments. Then, we present the machine learning algorithms and their hyperparameters used in our experiments. Lastly in chapter four, we introduce the two cross-validation methodologies used in our experiments. For Leave-one-subject-out cross validation (LOSOCV), the model is trained on all subjects but one, and tested on the other. When each participant appears in the test set once and only once, the results are averaged. This cross validation technique tells us about the model's performance on a previously unseen subject. For intra-subject cross validation, we split each subjects data into two. We train on either half, and test on the other, then average the results. We get a final accuracy by averaging all accuracies obtained from each participant. This method shows whether the model can benefit from having the same subject's data on both the train and the test sets. The results of the experiments are covered in the fifth chapter. We presented the results for the synthetic, WESAD and DriveDB datasets, respectively. The results for the synthetic dataset indicated that as the magnitude of the physiological change that mimics stress increases, stress detection accuracy also improves. For example, when the heart rate variability -an important stress indicator- is raised, the topological features could detect it almost perfectly. The results imply that stress detection errors in real-world datasets can be attributed to the noisy nature of the dataset itself, rather than the topological features. For example, such lack of effect can appear when some participants do not react to the stress condition. When the results from the real datasets were investigated, we usually observed the highest affect recognition accuracies when features coming from all persistence diagrams (level sets and delay embeddings) are used. Nevertheless, using only one persistence diagram (resulting in much fewer features) we were able to achieve similar recognition performance. This tells us that high accuracies are attainable using a small number of automatically engineered topological features rather than hand-crafted signal-specific features. For the three-class tasks, we observed that stress conditions are well separated from other conditions. This result supports the hypothesis that topological features works pretty well in distinguishing chaotic time series from non chaotic ones. When we made a binary classification task (stress vs non stress), topological features again performed better than those used in the original studies for most of the physiological signals. We have already stated that an important advantage of the subwindowing method is to be able to change the window size effectively. When we tested different window sizes, we observed that higher windows implied better stress detection performance. Furthermore, model performance with intra-subjects cross validation was significantly higher than LOSOCV. This was an expected finding since the model can perform better on the test set when the data from the same subject appears in the training set. Finally, in the sixth chapter, we outline our methodologies and their limitations. We also discussed what future works can aim for. For example, future studies can assess the performance of a model trained on one dataset and tested on another. Also, later research can use semi-supervised (rather than supervised) tasks for even improved accuracies. Lastly, one can use other vector representations of persistence diagrams for feature engineering.
Açıklama
Thesis(Ph.D.) -- Istanbul Technical University, Graduate School, 2022
Anahtar kelimeler
topological data analysis, topolojik veri analizi, time series, zaman serileri, mathematical analysis, matematiksel analiz
Alıntı