LEE- Bilgisayar Mühendisliği-Yüksek Lisans
Bu koleksiyon için kalıcı URI
Gözat
Son Başvurular
1 - 5 / 77
-
ÖgeTowards robustness in 3D point cloud analysis: Novel approaches to adversarial attacks and defences(Graduate School, 2025-01-21)This thesis explores the domain of adversarial robustness in 3D point cloud data, addressing both the offensive and the defensive aspects of adversarial interactions. The subject focuses on designing methods for adversarial attacks and defence mechanisms, particularly for applications in safety-critical domains like autonomous driving, robotics, and facial recognition. The first part of the study introduces a novel adversarial attack method, named the ε-Mesh Attack. This method confines perturbations to the surface of 3D meshes, preserving the structural integrity of facial data. Unlike traditional approaches that operate within a 3D ε-ball, the ε-Mesh Attack reduces the optimization domain to 2D triangular planes by employing two projection methods: Central projection and Perpendicular projection. These methods ensure that adversarial manipulations remain realistic while misleading classification models. Evaluations were conducted using PointNet and DGCNN models trained on well-known 3D datasets. The results demonstrate that the ε-Mesh Attack effectively compromises model performance while maintaining the original surface integrity. In the second part, the thesis proposes a novel defence mechanism called Point Cloud Layerwise Diffusion (PCLD). PCLD enhances robustness by employing a diffusion-based purification process that operates layer by layer within the neural network. The method involves training diffusion probabilistic models for each layer of a classifier, enabling hierarchical purification of adversarial perturbations. Suggested Point Cloud Layerwise Diffusion method was tested against state-of-the-art defence techniques and showed superior or comparable performance, particularly in defending against deeper-layer attacks. The conclusions derived from this research emphasize the importance of preserving structural integrity during adversarial attacks and the effectiveness of layerwise purification in defending against such attacks. The findings contribute to advancing secure and resilient 3D point cloud processing methods, paving the way for their safe deployment in critical applications. Future work aims to extend these methods into the temporal domain and adapt them to handle emerging adversarial strategies effectively.
-
ÖgeFacial expression analysis foran online usability evaluation platform(Graduate School, 2025-01-07)Facial Expression Analysis (FEA) is crucial in human-computer interaction (HCI), because recognizing and responding to user behavior can greatly improve engagement and satisfaction. It helps evaluate user experiences in customer service and interactive online usability platforms. However, applying FEA in online environments faces many difficulties, such as different lighting setups, diverse facial structures, and spontaneous expressions. These factors can reduce the accuracy and reliability of current expression recognition models. To address these challenges, researchers have developed POSTERv1, an advanced deep-learning model. POSTERv1 uses a feature extraction block, a transformer block, and a multilayer perceptron (MLP) classifier to classify facial expressions. Its cross-fusion transformer encoder (CFTE) layer supports the interaction of appearance-based and structural features, helping the model capture emotional cues. By including cross-fusion multi-head self-attention (CFMSA), POSTERv1 focuses on the most important areas of the face, reducing irrelevant features. A simplified version of POSTERv1 was also introduced to handle practical and computational constraints. This version removes the pyramid structure of the original model and uses only one CFTE block, making it faster and still highly accurate in most tests. In this study, we built a custom dataset of videos showing customers performing tasks in different online shopping environments. We aim to observe user behaviors on shopping platforms to evaluate the usability of the designed platform and designed two scenarios: one with the help of a moderator and one without it. In the moderator-assisted setting, participants received guidance and support when faced with difficulties. These conditions allowed us to compare emotional responses in moderator-assisted versus independent scenarios. We also evaluated and compared the latest FEA models on popular public datasets such as AffectNet, RAF-DB, FER2013, and CK+. We specifically tested how well models trained on AffectNet would perform when cross-evaluated on FER2013 and CK+ datasets. Notably, the simplified POSTERv1 model delivered better and faster performance while preserving accuracy in most public datasets than the original POSTERv1. Besides offering faster performance, the simplified model showed slightly greater robustness, making it a possible candidate for real-time emotion analysis. Based on these promising results, we used the simplified POSTERv1 to predict emotions in our custom dataset. The model showed reliable speed and efficiency in expression recognition, although factors like lighting and individual facial features had impacts on performance. In addition, an additional subset of our custom dataset included text-based sentiment annotations, derived from participants' speech, labeled as positive, negative, or neutral. We compared predicted facial expressionswith sentiment labels to assess the performance of simplified POSTERv1 model. Outputs from the custom dataset utilizing simplified POSTERv1 model showed that having a moderator encouraged more positive expressions and fewer negative reactions, demonstrating the model's ability to capture changes in emotional responses. However, the absence of annotations limits the ability to thoroughly assess the model's performance. Factors such as poses, like looking downward, and variations in environmental conditions, including lighting, can impact the reliability of the model. These considerations underscore the need to address aspects like illumination, facial diversity, and contextual user interactions when developing and deploying FEA tools. In conclusion, this thesis demonstrates how advanced deep-learning models like POSTERv1 and its simplified variant can effectively handle real-world challenges in FEA. These models achieve reliable real-time FEA by focusing on the most important facial features and adapting to different environmental and social conditions. Tests on both public datasets and a custom online shopping dataset show that the simplified POSTERv1 is faster and more efficient for real-time predictions, making it suitable for practical HCI applications. However, issues such as lighting conditions and certain facial poses point to areas that need further study. These challenges point to areas for further improvement, but the approach presented in this thesis can help designers develop online platforms that better understand and respond to user behavior, ultimately improving engagement and satisfaction across various interactive digital services and online usability evaluation platforms.
-
ÖgeLearning weights of losses on multiscale in crowd counting(Graduate School, 2023)In our daily lives, most of us are in crowded environments, and sometimes we can comment on the crowdedness of the environments we are in. In order to protect the safety of society, crowd analysis is needed to be efficient and executable in highly crowded areas such as shopping malls and stadiums. Cinemas, concerts, touristic places, and popular streets are very important in order to achieve their goals, such as security, entrance-exit information, annual visit rate, and number of instant visitors. Having an idea of the density of public spaces can also be very important in some unexpected situations. For example, in the case of COVID-19, as important as measuring interpersonal distance was, the number of people in a place was also very important. A certain number of people were admitted to public transport, public buildings, etc. Although this is just an example, it shows the importance of crowd analysis in our social lives and the importance of the studies to be done afterwards. There are some studies on performing crowd analysis on visual data. These studies generally have focused on images taken from street cameras or on data collected from the internet. Although some studies target real time, most of the studies in the literature aim to obtain the closest result to the real value of certain image data sets. In this study, like the studies in the literature, it was aimed to reduce the error rate, and proposed methods for it were tried. At the same time, a common data set, including images taken from street cameras and images collected from the internet, was used to make comparisons with studies in the literature. Focusing on the studies using this data set, changes were made in the model architectures. It was desired to achieve more successful results by combining different studies. Since Convolutional Neural Networks (CNN) have been used in this field in recent years and very successful results have been obtained, it was aimed to use a CNN-based architecture in this study. Improvements were made in the optimization part of this study by using the Multiscale Crowd Counting and Localization method, which is a recent approach. It has been shown that the weight parameters used in the architecture can be learned. While the data set used in the literature makes predictions by making a crowd map at the preliminary stage, a point-based approach is followed in this study, and the coordinate information of the people is used. Since the coordinate information is obtained as output, it is determined at the points where the people are. Additionally, some experiments were carried out on dimensions and combining different channels in the model. When the improvements made and the studies in the literature were compared, it was determined that the crowd analysis (number of people) errors on the images were reduced. In addition to the ShanghaiTech data set used in the reference study (Multiscale Crowd Counting and Localization ), the UCF_CC_50 data set was also used and results on both were compared with other studies in the literature. It was observed that the error rate decreased by 12% compared to the reference study.
-
Öge3D face animation generation from audio using convolutional neural networks(Graduate School, 2022)Problem of generating facial animations is an important phase of creating an artificial character in video games, animated movies, or virtual reality applications. This is mostly done manually by 3D artists, matching face model movements for each speech of the character. Recent advancements in deep learning methods have made automated facial animation possible, and this research field has gained some attention. There are two main variants of the automated facial animation problem: generating animation in 2D or in 3D space. The systems that work on the former problem work on images, either generating them from scratch or modifying the existing image to make it compatible with the given audio input. The second type of systems works on 3D face models. These 3D models can be directly represented by a set of points or parameterized versions of these points in the 3D space. In this study, 3D facial animation is targeted. One of the main goals of this study is to develop a method that can generate 3D facial animation from speech only, without requiring manual intervention from a 3D artist. In the developed method, a 3D face model is represented by Facial Action Coding System (FACS) parameters, called action units. Action units are movements of one or more muscles on the face. By using a single action unit or a combination of different action units, most of the facial expressions can be presented. For this study, a dataset of 37 minutes of recording is created. This dataset consists of speech recordings, and corresponding FACS parameters for each timestep. An artificial neural network (ANN) architecture is used to predict FACS parameters from the input speech signal. This architecture includes convolutional layers and transformer layers. The outputs of the proposed solution are evaluated on a user study by showing the results of different recordings. It has been seen that the system is able to generate animations that can be used in video games and virtual reality applications even for novel speakers it is not trained for. Furthermore, it is very easy to generate facial animations after the system is trained. But an important drawback of the system is that the generated facial animations may lack accuracy in the mouth/lip movement that is required for the input speech.
-
ÖgeGesture recognition and customization on textile-based pressure sensor array(Graduate School, 2024-08-01)The tactile sensation plays an essential part in perceiving and interacting with our surroundings, making touch-based technologies increasingly significant in everyday life. The technologies cover a wide spectrum from cellphones and tablets to sensors made entirely of textiles. When creating tactile sensing systems, multiple parameters need to be taken into account. Although touchscreens are the ideal choice for systems that need visual feedback, wearable technology requires devices that are soft, flexible, adjustable to the human body's shapes, and free from safety concerns. Textile-based capacitive pressure sensor array is selected for pressure sensing in gesture recognition system because of its accurate pressure detection capability and lightweight design. A pressure sensor array consisting of 11x2 sensors has been manufactured that rely on the principle of determining the location of pressure applied through variations in capacitance. It generates 22-dimensional capacitance data vector. In order to detect regions with higher capacitance when pressure is applied with fingertips, a sequence of data processing procedures, such as calibration, scaling, and flattening are executed. The manipulated data reveal a series of consecutively pressed cells on the textile sensor. In order to analyze the patterns of pressure applied cells, a deep learning model called Long Short-Term Memory (LSTM) and a machine learning model Hidden Markov Model (HMM) are utilized. The results of the two models were compared, and based on the obtained results, a high level of accuracy was achieved. In considering the difficulties caused by memorizing gestures or the inability of users to execute pre-defined gestures, it was considered crucial to enable the creation of new gestures and customization of them. In order to solve this issue, a class-incremental approach was implemented. The proposed approach deals with two primary problems: missing previous data and the inability to identify a new class of data. Changes were implemented to the LSTM layer and output layer of the existent model. The amount of new data sample is a factor that affects the equilibrium between usability and accuracy. As the amount of data sample grows, the duration of training increases and the usability decreases. On the other hand, when there are fewer data samples, the accuracy of the model reduces. In order to tackle this issue, a compromise was made by gathering a small number of data samples from the user and then enlarging the dataset through the utilization of data augmentation techniques. In order to reduce the risk of forgetting previous classes, the model was enhanced by using past data as inputs. An experiment was carried out in two phases, involving a total of 20 people. During the first phase, the participants were instructed to execute four predetermined movements -up, down, pinch, and zoom—. During the second phase, participants were instructed to repeatedly execute a new gesture in order to collect data for the creation of the model of the new gesture. Afterwards, they were instructed to execute both predetermined and newly defined gestures in order to confirm the recognition of the new data and ensure that the previous data were not forgotten. This phase was repeated to demonstrate the capacity to establish several new gestures. The recognition rate of the predetermined movements in the first phase gave accuracy values of 95.3% for deep learning model, 91.5% for machine learning model. After including one new gesture, the overall system achieved an accuracy of 96.3% for deep learning model, 91.7% for machine learning model. Furthermore, after introducing a second new gesture, the accuracy dropped slightly to 94.7% for deep learning model, 89.4% for machine learning model. Further studies will investigate the utilization of this technology as a controller.