3D face animation generation from audio using convolutional neural networks

Ünlü, Türker

3D face animation generation from audio using convolutional neural networks

dc.contributor.advisor	Sarıel, Sanem
dc.contributor.author	Ünlü, Türker
dc.contributor.authorID	504171557
dc.contributor.department	Computer Engineering Programme
dc.date.accessioned	2025-06-19T11:53:46Z
dc.date.available	2025-06-19T11:53:46Z
dc.date.issued	2022
dc.description	Thesis (M.Sc.) -- Istanbul Technical University, Graduate School, 2022
dc.description.abstract	Problem of generating facial animations is an important phase of creating an artificial character in video games, animated movies, or virtual reality applications. This is mostly done manually by 3D artists, matching face model movements for each speech of the character. Recent advancements in deep learning methods have made automated facial animation possible, and this research field has gained some attention. There are two main variants of the automated facial animation problem: generating animation in 2D or in 3D space. The systems that work on the former problem work on images, either generating them from scratch or modifying the existing image to make it compatible with the given audio input. The second type of systems works on 3D face models. These 3D models can be directly represented by a set of points or parameterized versions of these points in the 3D space. In this study, 3D facial animation is targeted. One of the main goals of this study is to develop a method that can generate 3D facial animation from speech only, without requiring manual intervention from a 3D artist. In the developed method, a 3D face model is represented by Facial Action Coding System (FACS) parameters, called action units. Action units are movements of one or more muscles on the face. By using a single action unit or a combination of different action units, most of the facial expressions can be presented. For this study, a dataset of 37 minutes of recording is created. This dataset consists of speech recordings, and corresponding FACS parameters for each timestep. An artificial neural network (ANN) architecture is used to predict FACS parameters from the input speech signal. This architecture includes convolutional layers and transformer layers. The outputs of the proposed solution are evaluated on a user study by showing the results of different recordings. It has been seen that the system is able to generate animations that can be used in video games and virtual reality applications even for novel speakers it is not trained for. Furthermore, it is very easy to generate facial animations after the system is trained. But an important drawback of the system is that the generated facial animations may lack accuracy in the mouth/lip movement that is required for the input speech.
dc.description.degree	M.Sc.
dc.identifier.uri	http://hdl.handle.net/11527/27347
dc.language.iso	en
dc.publisher	Graduate School
dc.sdg.type	Goal 9: Industry, Innovation and Infrastructure
dc.sdg.type	Goal 10: Reduced Inequality
dc.sdg.type	Goal 12: Responsible Consumption and Production
dc.subject	Artificial intelligence
dc.subject	Deep learning
dc.subject	3D face animation
dc.subject	convolutional neural networks
dc.title	3D face animation generation from audio using convolutional neural networks
dc.title.alternative	Evrişimsel ağlar ile sesten 3B yüz animasyonu üretilmesi
dc.type	Master Thesis

Dosyalar

Orijinal seri

Şimdi gösteriliyor 1 - 1 / 1

Ad:: 843567.pdf
Boyut:: 3.61 MB
Format:: Adobe Portable Document Format
Açıklama

İndir

Lisanslı seri

Şimdi gösteriliyor 1 - 1 / 1

Ad:: license.txt
Boyut:: 1.58 KB
Format:: Item-specific license agreed upon to submission
Açıklama

İndir

Koleksiyonlar

LEE- Bilgisayar Mühendisliği-Yüksek Lisans