3D face animation generation from audio using convolutional neural networks
3D face animation generation from audio using convolutional neural networks
dc.contributor.advisor | Sarıel, Sanem | |
dc.contributor.author | Ünlü, Türker | |
dc.contributor.authorID | 504171557 | |
dc.contributor.department | Computer Engineering Programme | |
dc.date.accessioned | 2025-06-19T11:53:46Z | |
dc.date.available | 2025-06-19T11:53:46Z | |
dc.date.issued | 2022 | |
dc.description | Thesis (M.Sc.) -- Istanbul Technical University, Graduate School, 2022 | |
dc.description.abstract | Problem of generating facial animations is an important phase of creating an artificial character in video games, animated movies, or virtual reality applications. This is mostly done manually by 3D artists, matching face model movements for each speech of the character. Recent advancements in deep learning methods have made automated facial animation possible, and this research field has gained some attention. There are two main variants of the automated facial animation problem: generating animation in 2D or in 3D space. The systems that work on the former problem work on images, either generating them from scratch or modifying the existing image to make it compatible with the given audio input. The second type of systems works on 3D face models. These 3D models can be directly represented by a set of points or parameterized versions of these points in the 3D space. In this study, 3D facial animation is targeted. One of the main goals of this study is to develop a method that can generate 3D facial animation from speech only, without requiring manual intervention from a 3D artist. In the developed method, a 3D face model is represented by Facial Action Coding System (FACS) parameters, called action units. Action units are movements of one or more muscles on the face. By using a single action unit or a combination of different action units, most of the facial expressions can be presented. For this study, a dataset of 37 minutes of recording is created. This dataset consists of speech recordings, and corresponding FACS parameters for each timestep. An artificial neural network (ANN) architecture is used to predict FACS parameters from the input speech signal. This architecture includes convolutional layers and transformer layers. The outputs of the proposed solution are evaluated on a user study by showing the results of different recordings. It has been seen that the system is able to generate animations that can be used in video games and virtual reality applications even for novel speakers it is not trained for. Furthermore, it is very easy to generate facial animations after the system is trained. But an important drawback of the system is that the generated facial animations may lack accuracy in the mouth/lip movement that is required for the input speech. | |
dc.description.degree | M.Sc. | |
dc.identifier.uri | http://hdl.handle.net/11527/27347 | |
dc.language.iso | en | |
dc.publisher | Graduate School | |
dc.sdg.type | Goal 9: Industry, Innovation and Infrastructure | |
dc.sdg.type | Goal 10: Reduced Inequality | |
dc.sdg.type | Goal 12: Responsible Consumption and Production | |
dc.subject | Artificial intelligence | |
dc.subject | Deep learning | |
dc.subject | 3D face animation | |
dc.subject | convolutional neural networks | |
dc.title | 3D face animation generation from audio using convolutional neural networks | |
dc.title.alternative | Evrişimsel ağlar ile sesten 3B yüz animasyonu üretilmesi | |
dc.type | Master Thesis |