Exploiting clustering patterns in training sets to improve classification performance of fully connected layers

thumbnail.default.alt
Tarih
2023-04-06
Yazarlar
Kalaycı, Tolga Ahmet
Süreli Yayın başlığı
Süreli Yayın ISSN
Cilt Başlığı
Yayınevi
Graduate School
Özet
Fully connected layers are used in almost all neural network architectures ranging from multilayer perceptrons to deep neural networks. These layers allow any kind of interaction between features without making any assumption about the structure of the data. Thanks to this property, with sufficient complexity, fully connected layers are expected to learn any kind of patterns. Practical experience has revealed that this theoretical potential is often not realized. Success of convolutional and recursive layers and findings of many studies have proven that the intrinsic structure of a dataset holds a great potential to improve the success of a classification problem. These layers basically take advantage of the inductive bias based on spatial or sequential structures of specific data types such as text, image, video etc. Also, leveraging clustering to explore and exploit this intrinsic structure in classification problems has been the subject of various studies. This potential led this study to search for a way to incorporate the clustering information of a training set, as a kind of an inductive bias, into the working mechanism of fully connected layers. In this thesis, two different methods are proposed. Both methods aim to improve the classification performance of fully connected layers by feeding them a prior information about the clustering stucture embedded in the training dataset. The first method is a regularization method that focuses on improving the classification results in case of high variance. The second method concentrates on making better predictions in case of high bias. Throughout the study, it was ensured that the methods suggested were applicable regardless of the type of problem being studied and the number of fully connected layers in the architecture. The first method incorporates clustering information of a training set into fully connected layer's nodes without incurring much additional computational costs. It basically depends on clustering the observations before the training phase and then allocating specific nodes in the fully connected layer to one of these clusters during the training. The point of inspiration for this method was the dropout method which is a widely accepted stochastic regularization technique for neural networks. Dropout is using a totally randomized binary matrix to randomly shut down some of the hidden nodes during training iterations. The idea of using a similar matrix to feed the information of different clusters in the training set is the initial step of the proposed solution. Obviously, this matrix has a structured form rather than a randomized one and is obtained by an unsupervised clustering algorithm applied before the training phase. For this unsupervised phase, K-Means and Fuzzy C-Means clustering algorithms were tried separately and their results are compared to the dropout technique as well as to each other. The output matrix in the unsupervised phase is called "Cluster Info Matrix" throughout the thesis. Here we find it essential to note that the fuzzy cluster info matrix always revises the values of the activations in line with the magnitude of the related degree of membership, whereas the K-Means cluster info matrix leaves some of the activations unchanged and set the rest to zero. The difference between the way of manipulations of the fuzzy and K-Means cluster info matrices resembles the difference between L1 and L2 regularization techniques. It is reasonable to propose that L1 regularization, which forces less important variables' weights to be zero, behaves like the K-Means cluster info matrix, whereas the L2 regularization, which tends to diminish the magnitudes of the weights, behaves like the fuzzy cluster info matrix. In the experiment part for the first proposed method, due to imbalanced structure of the dataset, a threshold free performance metric, "Area Under Curve" (AUC) was defined as the target metric. The experiments on the K-Means version of the first proposed method show that even for very low significance levels, the proposed method gives statistically significant higher AUC values in the test set compared to dropout for all architectures covered in the experiments. At this point, the question of whether these improvements are really a result of the node-to-cluster allocation logic, or the same improvement could be achieved by using an arbitrary binary matrix as the cluster info matrix, was also tested as part of the experiments. With this purpose, experiments were repeated by replacing the cluster info matrix with a random binary matrix. The results showed that the cluster-to-node allocation logic of the proposed method plays a significant role in the improvements achieved. During the experiments, it was also observed that the dissimilarity of the clusters as well play an essential role in the results. As expected, the difference made by the proposed method was observed to decrease as distinguishability of the clusters weakens. In the experiments for the Fuzzy C-Means version, the same experimental procedure as in the k- means version was followed. The Fuzzy C-Means version of the proposed method yielded even better results than the K-Means version and consequently to dropout with statistically significant higher test AUC values. The key contributions of the first proposed method can be summarized under four headings, namely (i) it proposes a fully connected layer which embeds the information on intrinsic clusters in the dataset into its hidden nodes, (ii) develops a fuzzy cluster-aware regularization technique for fully connected layers, (iii) it provides experimental results indicating a better performance of the proposed method in classification problems in comparison to the widely adopted fully connected regularization technique, dropout and (iv) it is compatible with any classification architecture that uses fully connected layers. The second proposed method introduces a new training pipeline for fully connected layers in which the extracted features are expected to have the ability to cluster the dataset in the same way as in the original feature space. The method consists of two main stages which are pre-training and training. In the pre-training stage, the dataset is clustered using Fuzzy C-Means algorithm and then a matrix that contains the fuzzy membership degrees of each observation to each cluster is created. The resulting fuzzy membership degrees matrix becomes an input to the second main stage of the proposed method. In the training stage of the proposed method, the fully connected layer is trained in a way to minimize a combined cost function that includes both classification and clustering costs aggregated in a weighted manner. In the experiments part for the second proposed method, performance of a single fully connected layer, which is trained by the proposed method and a regular single fully connected layer are compared to each other. In line with similar studies, test set accuracy metric is defined as the target metric. The results showed that even for very low significance levels, the proposed method is superior compared to a regular fully connected layer in ten of eleven experiments. The experiments also showed that the results of the proposed method are distributed within a range resulting in smaller or at least equal standard deviations compared to the results of the regular fully connected layer. Moreover, in the experimental part of the second proposed method, the variation of clustering costs obtained during training of multiple fully connected layers was investigated. The observations have provided evidence that it becomes more difficult for the obtained features to learn the clustering patterns in the original feature space as we move towards the last layers. Behaviors in an architecture with more than one fully connected layer are not further elaborated, addressing future work. The key contributions of the second proposed method can be listed under five items: (i) it proposes a new training process which makes fully connected layers benefit from the clustering structure of the training dataset; (ii) it puts forward an enhanced fully connected layer which has the ability to classify and cluster a dataset simultaneously; (iii) it incorporates the learning process of cluster centroids into backpropagation; (iv) it conducts experiments that indicate superior prediction performances of the proposed method in various benchmark datasets compared to regular fully connected layers, and (v) it is ready to be employed, without any revision, in any classification architecture that uses fully connected layers. In today's world, machine learning and particularly artificial intelligence applications have a significant role in decision making and automation systems in management, healthcare, public, marketing, agriculture, manufacturing, finance, and technology fields. In these industries, the impacts of decisions made by machine learning algorithms are huge and mostly have very important consequences. Impacts in healthcare and public industries directly relates to human well-being which is hard to quantify with money. On the other hand, industry reports show that estimated financial benefits of machine learning solutions are measured by millions or even billions of dollars in private and public sector. The magnitudes of financial and social impacts of machine learning use cases prove the importance of marginal benefits that can be derived by even small performance improvements in machine learning algorithms. Considering this potential, two methods which are proposed in this thesis create a significant opportunity for the performance improvements of many artificial intelligence applications and consequently for their financial and social impacts.
Açıklama
Thesis(Ph.D.) -- Istanbul Technical University, Graduate School, 2023
Anahtar kelimeler
multi classification, çoklu sınıflandırma
Alıntı