Advanced techniques and comprehensive analysis in speech emotion recognition using deep neural networks
| dc.contributor.advisor | Köse, Hatice | |
| dc.contributor.author | Yetkin, Ahmet Kemal | |
| dc.contributor.authorID | 504201506 | |
| dc.contributor.department | Computer Engineering | |
| dc.date.accessioned | 2025-01-09T12:39:18Z | |
| dc.date.available | 2025-01-09T12:39:18Z | |
| dc.date.issued | 2024-07-01 | |
| dc.description | Thesis (M.Sc.) -- İstanbul Technical University, Graduate School, 2024 | |
| dc.description.abstract | The rapid advancement in artificial intelligence technologies has resulted in significant progress in human-computer interaction (HCI) and related fields. In HCI, the ability of machines to perceive and understand users' emotional states in real-time is crucial for enhancing the user experience. Accurate recognition of emotions enables machines to provide more personalized and effective services. Over the past fifty years, research on the recognition of speech and speech emotion recognition (SER) has made considerable strides, continuously expanding the knowledge base in this area. Speech is one of the fundamental elements of human communication and offers rich information about the speaker's emotional state. Changes in tone, speed, emphasis, and pitch play significant roles in reflecting the speaker's emotions. Therefore, analyzing speech can provide deeper insights into the speaker's feelings, thoughts, and intentions. It is widely accepted that the human voice is the primary instrument for emotional expression and that tone of voice is the oldest and most universal form of communication. In this context, the ability of machines to interpret these tones can greatly enhance the performance of HCI systems. Recognizing emotion from speech is a significant research area in affective computing. This task is challenging due to the highly personal nature of emotions, which even humans can find difficult to understand accurately. Speech emotion recognition has numerous practical applications, including emotion-aware HCI systems, traffic problem-solving, robotics, and mental health diagnosis and therapy. For instance, in customer service systems or mobile communication, a customer's emotional state can be inferred from their tone of voice, and this information can be used to provide better service. In educational support systems, it can help improve children's socio-emotional skills and academic abilities. Recognizing emotions from speech can also provide early warnings for drivers who are excessively nervous or angry, thereby reducing the likelihood of traffic accidents. Moreover, such systems hold great potential for individuals who struggle to express their emotions, such as children with autism spectrum disorder (ASD). This study aims to develop a method for detecting emotions from speech and to use this method to improve the performance of existing speech emotion recognition (SER) systems. In this context, various feature extraction methods have been evaluated to identify the most distinctive voice characteristics for recognizing emotions. These methods include Mel Frequency Cepstral Coefficients (MFCC), Mel spectrogram, Zero-Crossing Rate (ZCR), and Root Mean Square Energy (RMSE). The extracted features have been used in conjunction with deep learning models. Initially, these features were transformed into two-dimensional images and optimized on pre-trained networks, then trained on a one-dimensional convolutional neural network (CNN) architecture. Finally, a combined CNN and Long Short-Term Memory (LSTM) model was used. Throughout this research, critical questions were addressed, such as whether speech features can accurately detect human emotional states and which feature extraction method performs best in the literature. The study specifically examined the impact of various feature extraction methods, including MFCC, Mel spectrogram, Chroma, Root Mean Square Energy (RMSE), and Zero-Crossing Rate (ZCR). The effects of different image formats of MFCC and Mel-spectrogram audio features on accuracy rates and how these formats influence model performance were also explored. Additionally, the study aimed to determine which pre-trained model, among VGG16, VGG11\_bn, ResNet-18, ResNet-101, AlexNet, and DenseNet, performs best when fine-tuned. The impact of audio data augmentation methods on test results was evaluated, analyzing how increasing and diversifying the dataset affects the overall accuracy and robustness of the models. This research aims to address these questions to contribute to the development of more accurate and robust systems for speech emotion recognition. | |
| dc.description.degree | M.Sc. | |
| dc.identifier.uri | http://hdl.handle.net/11527/26159 | |
| dc.language.iso | en_US | |
| dc.publisher | Graduate School | |
| dc.sdg.type | Goal 9: Industry, Innovation and Infrastructure | |
| dc.subject | emotion recognition | |
| dc.subject | duygu tanıma | |
| dc.subject | speech analysis | |
| dc.subject | konuşma analizi | |
| dc.subject | speech signals | |
| dc.subject | konuşma işaretleri | |
| dc.subject | artificial intelligence | |
| dc.subject | yapay zeka | |
| dc.title | Advanced techniques and comprehensive analysis in speech emotion recognition using deep neural networks | |
| dc.title.alternative | Derin sinir ağları kullanarak konuşma duygu tanıma üzerine gelişmiş teknikler ve kapsamlı analiz | |
| dc.type | Master Thesis |
