3D face animation generation from audio using convolutional neural networks

dc.contributor.advisor Sarıel, Sanem
dc.contributor.author Ünlü, Türker
dc.contributor.authorID 504171557
dc.contributor.department Computer Engineering Programme
dc.date.accessioned 2025-06-19T11:53:46Z
dc.date.available 2025-06-19T11:53:46Z
dc.date.issued 2022
dc.description Thesis (M.Sc.) -- Istanbul Technical University, Graduate School, 2022
dc.description.abstract Problem of generating facial animations is an important phase of creating an artificial character in video games, animated movies, or virtual reality applications. This is mostly done manually by 3D artists, matching face model movements for each speech of the character. Recent advancements in deep learning methods have made automated facial animation possible, and this research field has gained some attention. There are two main variants of the automated facial animation problem: generating animation in 2D or in 3D space. The systems that work on the former problem work on images, either generating them from scratch or modifying the existing image to make it compatible with the given audio input. The second type of systems works on 3D face models. These 3D models can be directly represented by a set of points or parameterized versions of these points in the 3D space. In this study, 3D facial animation is targeted. One of the main goals of this study is to develop a method that can generate 3D facial animation from speech only, without requiring manual intervention from a 3D artist. In the developed method, a 3D face model is represented by Facial Action Coding System (FACS) parameters, called action units. Action units are movements of one or more muscles on the face. By using a single action unit or a combination of different action units, most of the facial expressions can be presented. For this study, a dataset of 37 minutes of recording is created. This dataset consists of speech recordings, and corresponding FACS parameters for each timestep. An artificial neural network (ANN) architecture is used to predict FACS parameters from the input speech signal. This architecture includes convolutional layers and transformer layers. The outputs of the proposed solution are evaluated on a user study by showing the results of different recordings. It has been seen that the system is able to generate animations that can be used in video games and virtual reality applications even for novel speakers it is not trained for. Furthermore, it is very easy to generate facial animations after the system is trained. But an important drawback of the system is that the generated facial animations may lack accuracy in the mouth/lip movement that is required for the input speech.
dc.description.degree M.Sc.
dc.identifier.uri http://hdl.handle.net/11527/27347
dc.language.iso en
dc.publisher Graduate School
dc.sdg.type Goal 9: Industry, Innovation and Infrastructure
dc.sdg.type Goal 10: Reduced Inequality
dc.sdg.type Goal 12: Responsible Consumption and Production
dc.subject Artificial intelligence
dc.subject Deep learning
dc.subject 3D face animation
dc.subject convolutional neural networks
dc.title 3D face animation generation from audio using convolutional neural networks
dc.title.alternative Evrişimsel ağlar ile sesten 3B yüz animasyonu üretilmesi
dc.type Master Thesis
Dosyalar
Orijinal seri
Şimdi gösteriliyor 1 - 1 / 1
thumbnail.default.alt
Ad:
843567.pdf
Boyut:
3.61 MB
Format:
Adobe Portable Document Format
Açıklama
Lisanslı seri
Şimdi gösteriliyor 1 - 1 / 1
thumbnail.default.placeholder
Ad:
license.txt
Boyut:
1.58 KB
Format:
Item-specific license agreed upon to submission
Açıklama