Lifelong learning for auditory scene analysis

thumbnail.default.alt
Tarih
2022-07-04
Yazarlar
Bayram, Barış
Süreli Yayın başlığı
Süreli Yayın ISSN
Cilt Başlığı
Yayınevi
Graduate School
Özet
Due to the evolution in artificial intelligence and machine learning with the recent advancements in sensor technology, scene analysis is getting more attention for automatic sensing and understanding of dynamic environments including various targets, non-target objects, and noises. The sensory information stemming from the environments can be efficiently analyzed to infer the events, activities, and related objects. However, many issues encountered in the real world, exist that prevent robust sensing and information processing required for important real-time tasks in real dynamic environments with background noises, overlapping targets, high processing complexity, and so on. Automatic scene analysis is a major aspect of the process of collecting and extracting useful knowledge of objects and events to analyze scenes in terms of the places, situations, events, and activities. A significant number of scene analysis studies have mainly focused on visual processing approaches for the analysis of objects in various environments. Auditory Scene Analysis (ASA) has been used in various real-world tasks, which relies on the perception and analysis of stationary and non-stationary sounds from environmental events and activities, background noises, human voices, and other sound sources. In realistic environments, the dynamic spatio-temporal nature and complexity of environmental sounds, and the existence of novel events may eventually deteriorate the performance of ASA. Therefore, ASA in real-world environments is a difficult task and has not been extensively investigated. Lifelong learning that is progressively becoming a more crucial task in artificial intelligence is a continuous learning process in acquiring and adapting knowledge from dynamic environments. In this thesis, the task of lifelong ASA for Acoustic Event Recognition (AER) with Acoustic Novelty Detection (AND) is addressed to detect novel acoustic events, recognize known events, and learn in a self-learning manner. The problem is investigated by identifying and tackling various issues that may or will affect the ASA in a real-world learning environment. The main issues of lifelong learning in realistic environments are (i) existence of novel acoustic classes, (ii) existence of unlabeled data, (iii) cost of annotation, (iv) lack of adequate data for novel classes, (v) imbalanced data between classes, (vi) forgetting of previous data, and (vii) lack of memory for storing all the data and (viii) computational power for lifelong learning. In dynamic acoustic environments, the lifelong ASA for intelligent systems, agents, or robots in real-time is still an open issue. Also, recent deep learning methods for ASA have not been investigated yet while avoiding the issues of real-world environments. In this thesis, two approaches regarding the main issues, 1) a real-time ASA approach and 2) a deep learning-based ASA approach are investigated, which are able to recognize acoustic events, detect the novel events and then learn by the AER and AND models. However, both lifelong learning approaches have certain issues; which are the lack of incremental learning capability in the real-time approach for the ASA in a realistic environment, and the computational time in the deep learning-based approach. One of the main differences between the approaches is that the real-time ASA is applied to the streaming signal. Thus, each salient sound source in an acoustic scene is identified and localized by a Sound Source Localization (SSL) method to robustly perform the source-specific analysis of its signal. In addition to the SSL method, a segmentation technique is employed to segment variable-length time-series audio patterns of acoustic activities from the streaming signal to efficiently analyze the events and scenes. Moreover, the approaches differ in their audio features used for AER and AND taking into account the requirements of the algorithms and real-time processing. The first approach for lifelong learning in ASA based on a multilayered Hidden Markov Model (HMM) comprises five main steps: (1) SSL used for detection and location monitoring of the most salient sound source in a scene to perform source-specific analysis, (2) segmentation of time-series audio patterns on the streaming signal, (3) feature extraction from the segmented patterns and construction of a feature set for each pattern, (4) AER in a semi-supervised manner performed by class-specific HMMs associated with known events, (5) AND carried out using a single HMM for all the known events, from the outputs of AER module, and (6) lifelong self-learning (Chapter 4). In the step of lifelong learning, the updates of the models are realized, in which after recognizing an event, the HMM is retrained using more likely knowledge selected among all the previous and new knowledge of the event, and for a new acoustic event recently detected, a class-specific model is generated and the AND model is retrained. In Chapter 4, the offline and real-time experiments are given in detail, which are conducted using streaming signals from a real domestic environment. In the experiments, it is demonstrated that for real-time ASA, HMM for modeling the time-series audio patterns from the streaming signal is the most efficient algorithm for the AER and AND. In Chapter 5, the steps of the other proposed lifelong learning approach which is a deep learning-based approach for ASA in offline mode are explained, which are: (1) raw acoustic signal pre-processing, (2) extraction of low-level, time-varying spectral representation (spectrogram), and deep audio features, (3) AND, (4) acoustic signal augmentation, (5) AER, and (6) Incremental Class-Learning (ICL) of the audio features of the novel events. The self-learning on different types of audio features extracted from the acoustic signals of various events occurs without human supervision. For the extraction of deep audio representations, in addition to Visual Geometry Group (VGG) and Residual Neural Network (ResNet), Factorized Time-Delay Neural Network (FTDNN) and TDNN-based Long Short-Term Memory (TDNN-LSTM) networks are pre-trained using a large-scale audio dataset called Google AudioSet. As the input of the networks, Mel-Frequency Cepstral Coefficient (MFCC) and raw signals are used by F-TDNN and TDNN-LSTM, respectively, and Mel-spectrograms are taken by the VGG and ResNet. The performances of ICL with AND using Mel-spectrograms, and deep features with TDNNs, VGG, and ResNet are validated on benchmark audio datasets such as ESC-10, ESC-50, UrbanSound8K (US8K), and an audio dataset collected in a real domestic environment, also used with the proposed real-time ASA approach. The results demonstrate that the FearNet algorithm with the VGG-16 features is a more promising algorithm for incremental learning of new acoustic classes in the audio domain, and the GMM algorithm provided the best AND performances in various AND scenarios. Moreover, for the ICL with AND, the FearNet integrated with a GMM exhibits the effectiveness of scene analysis in a real-world acoustic environment to deal with novel events and recognition of unlabeled data using the ICL-based AER. The efficient performances in the experiments of AND and AER tasks are observed also using F-TDNN, and iCaRL has the ICL performances with VGG close to the performances of FearNet in ESC-10 and Domestic datasets.
Açıklama
Thesis(Ph.D.) -- Istanbul Technical University, Graduate School, 2022
Anahtar kelimeler
lifelong learning, yaşam boyu öğrenme, sensor technology, sensör teknolojisi
Alıntı