Topological data analysis and clustering algorithms in machine learning

thumbnail.default.placeholder
Tarih
2023-03-13
Yazarlar
Güzel, İsmail
Süreli Yayın başlığı
Süreli Yayın ISSN
Cilt Başlığı
Yayınevi
Graduate School
Özet
In this dissertation, we define a new non-Archimedean metric (a.k.a. an ultra-metric) called cophenetic metric on persistent homology classes of all degrees using only homological information. Then, based on numerical experiments on different datasets, we statistically verify that the topological information coming from the zeroth persistent homology with our cophenetic metric is consistent with the information provided by different hierarchical clustering algorithms using different metrics. We also observe that the clusters we obtained via the cophenetic metric do yield competitive silhouette scores and the Rand indices in comparison with clusters obtained from other metrics. The homological information about a filtered simplicial complex over the poset of positive real numbers is often presented by a barcode which depicts the evolution of the associated Betti numbers. However, there is wonderfully complex combinatorics associated with the homology classes of a filtered complex, and one can do more than just count them over the index poset. In this thesis, we show that this combinatorial information can be encoded by a filtered matroid, or even better, by rooted forests. We also show that these rooted forests can be realized as cobordisms. The subject of this thesis is presented in Chapter 1, in which we also provide a survey of the literature and a brief introduction to topological data analysis (TDA). In Chapter 2, we summarize the fundamental concepts from topological and metric spaces that we will need in subsequent chapters. In Chapter 3, we will go more deeply into the mathematical foundations of clustering techniques. Clustering algorithms, especially hierarchical clustering algorithms, play a key role in this dissertation. Comparisons of various clustering strategies also play important roles in this thesis. For this reason, we also survey numerical metrics to assess such clustering strategies in the same chapter. We also work out a complete example on a collection of Turkish cities and the distances between them, and compare the resulting clusters. The TDA calculations we make in this thesis heavily use simplicial and chain complexes, and their homologies. In all examples, we first calculate the homology of a filtered simplicial complexes in order to calculate its persistent homology. The primary computation procedure begins by distilling a point cloud into a filtered simplicial complex, followed by calculating the homology of the resulting filtered differential graded complex. Chapters 4 and 5 provide all the necessary background information on simplicial complexes, differential graded complexes, and homologies. We also present complete worked-out examples of calculating homology classes for a collection of simple simplicial complexes. To capture the exact topological features of the point cloud, the main challenge is to find an optimal proximity parameter for the simplicial technology. Persistent homology is developed precisely to deal with this issue. It keeps track of how long each topological feature of the supplied data endures as the proximity parameter varies. It produces multisets of intervals represented as barcodes. We cover these concepts in Chapter 6. Chapter 7 is devoted to developing a rigorous theoretical foundation for filtered simplicial and chain complexes. Along with developing such a foundation for filtered complexes, we also discovered an exciting combinatorial representation for homologies of filtered complexes in the form of filtered matroids, especially by employing dendrograms labeled by circuits in a matroid. In Chapter 8, we defined a new non-archimedean metric called homological cophenetic distance on homology classes which is the main contribution of this thesis to literature. In Chapter 9 we are concerned with developing another presentation of the homological information coming from a filtered complex using cobordisms of punctured spheres, which are themselves punctured higher dimensional spheres. In Chapter 10, we give a detailed account of how we used the cophenetic distance in our numerical experiments. In the first experiment, we compared the standard zeroth persistent homology representations (barcodes), the dendrogram we derived from the cophenetic distance, and the dendrogram coming from the hierarchical clustering algorithm with the Euclidean metric on a small synthetic dataset embedded in $\mathbb{R}^2$. In the second experiment, we studied the geographic coordinates of a small sample of Turkish cities. By following the same statistical comparison methodology as in the first experiment, we compare the dendrograms produced by cophenetic metrics on the zeroth homology and the dendrograms produced by hierarchical clustering algorithms using a variety of distance measurements in order to assess the validity of our study. In the third and last experiment, we applied the hierarchical clustering algorithms to various datasets (the dataset of Turkish cities, the Iris dataset, the Cancer Coimbra dataset, and two synthetic datasets) with different metrics including our cophenetic metric, and we statistically analyzed the clustering results. Finally, in Chapter11, we summarize and analyze the results we obtain, and the constraints of our approach such as need for high computational power and high memory requirements. We also discuss potential future research directions one can follow to extend this thesis such as designing suitable applications with real-world datasets, visualizing cobordisms, or developing filtered combinatorial simplicial complexes on a categorical dataset.
Açıklama
Thesis(Ph.D.) -- Istanbul Technical University, Graduate School, 2023
Anahtar kelimeler
algebraic topology, cebirsel topoloji, hierarchical clustering, hiyerarşik kümeleme, machine learning, makine öğrenmesi
Alıntı