Topological data analysis and clustering algorithms in machine learning

dc.contributor.advisor Kaygun, Atabey
dc.contributor.author Güzel, İsmail
dc.contributor.authorID 509182203
dc.contributor.department Mathematical Engineering
dc.date.accessioned 2024-01-26T11:13:23Z
dc.date.available 2024-01-26T11:13:23Z
dc.date.issued 2023-03-13
dc.description Thesis(Ph.D.) -- Istanbul Technical University, Graduate School, 2023
dc.description.abstract In this dissertation, we define a new non-Archimedean metric (a.k.a. an ultra-metric) called cophenetic metric on persistent homology classes of all degrees using only homological information. Then, based on numerical experiments on different datasets, we statistically verify that the topological information coming from the zeroth persistent homology with our cophenetic metric is consistent with the information provided by different hierarchical clustering algorithms using different metrics. We also observe that the clusters we obtained via the cophenetic metric do yield competitive silhouette scores and the Rand indices in comparison with clusters obtained from other metrics. The homological information about a filtered simplicial complex over the poset of positive real numbers is often presented by a barcode which depicts the evolution of the associated Betti numbers. However, there is wonderfully complex combinatorics associated with the homology classes of a filtered complex, and one can do more than just count them over the index poset. In this thesis, we show that this combinatorial information can be encoded by a filtered matroid, or even better, by rooted forests. We also show that these rooted forests can be realized as cobordisms. The subject of this thesis is presented in Chapter 1, in which we also provide a survey of the literature and a brief introduction to topological data analysis (TDA). In Chapter 2, we summarize the fundamental concepts from topological and metric spaces that we will need in subsequent chapters. In Chapter 3, we will go more deeply into the mathematical foundations of clustering techniques. Clustering algorithms, especially hierarchical clustering algorithms, play a key role in this dissertation. Comparisons of various clustering strategies also play important roles in this thesis. For this reason, we also survey numerical metrics to assess such clustering strategies in the same chapter. We also work out a complete example on a collection of Turkish cities and the distances between them, and compare the resulting clusters. The TDA calculations we make in this thesis heavily use simplicial and chain complexes, and their homologies. In all examples, we first calculate the homology of a filtered simplicial complexes in order to calculate its persistent homology. The primary computation procedure begins by distilling a point cloud into a filtered simplicial complex, followed by calculating the homology of the resulting filtered differential graded complex. Chapters 4 and 5 provide all the necessary background information on simplicial complexes, differential graded complexes, and homologies. We also present complete worked-out examples of calculating homology classes for a collection of simple simplicial complexes. To capture the exact topological features of the point cloud, the main challenge is to find an optimal proximity parameter for the simplicial technology. Persistent homology is developed precisely to deal with this issue. It keeps track of how long each topological feature of the supplied data endures as the proximity parameter varies. It produces multisets of intervals represented as barcodes. We cover these concepts in Chapter 6. Chapter 7 is devoted to developing a rigorous theoretical foundation for filtered simplicial and chain complexes. Along with developing such a foundation for filtered complexes, we also discovered an exciting combinatorial representation for homologies of filtered complexes in the form of filtered matroids, especially by employing dendrograms labeled by circuits in a matroid. In Chapter 8, we defined a new non-archimedean metric called homological cophenetic distance on homology classes which is the main contribution of this thesis to literature. In Chapter 9 we are concerned with developing another presentation of the homological information coming from a filtered complex using cobordisms of punctured spheres, which are themselves punctured higher dimensional spheres. In Chapter 10, we give a detailed account of how we used the cophenetic distance in our numerical experiments. In the first experiment, we compared the standard zeroth persistent homology representations (barcodes), the dendrogram we derived from the cophenetic distance, and the dendrogram coming from the hierarchical clustering algorithm with the Euclidean metric on a small synthetic dataset embedded in $\mathbb{R}^2$. In the second experiment, we studied the geographic coordinates of a small sample of Turkish cities. By following the same statistical comparison methodology as in the first experiment, we compare the dendrograms produced by cophenetic metrics on the zeroth homology and the dendrograms produced by hierarchical clustering algorithms using a variety of distance measurements in order to assess the validity of our study. In the third and last experiment, we applied the hierarchical clustering algorithms to various datasets (the dataset of Turkish cities, the Iris dataset, the Cancer Coimbra dataset, and two synthetic datasets) with different metrics including our cophenetic metric, and we statistically analyzed the clustering results. Finally, in Chapter11, we summarize and analyze the results we obtain, and the constraints of our approach such as need for high computational power and high memory requirements. We also discuss potential future research directions one can follow to extend this thesis such as designing suitable applications with real-world datasets, visualizing cobordisms, or developing filtered combinatorial simplicial complexes on a categorical dataset.
dc.description.degree Ph. D.
dc.identifier.uri http://hdl.handle.net/11527/24462
dc.language.iso en_US
dc.publisher Graduate School
dc.sdg.type Goal 4: Quality Education
dc.subject algebraic topology
dc.subject cebirsel topoloji
dc.subject hierarchical clustering
dc.subject hiyerarşik kümeleme
dc.subject machine learning
dc.subject makine öğrenmesi
dc.title Topological data analysis and clustering algorithms in machine learning
dc.title.alternative Topolojik veri analizi ve makine öğreniminde kümeleme algoritmaları
dc.type Doctoral Thesis
Dosyalar
Lisanslı seri
Şimdi gösteriliyor 1 - 1 / 1
thumbnail.default.placeholder
Ad:
license.txt
Boyut:
1.58 KB
Format:
Item-specific license agreed upon to submission
Açıklama