Difüzyon ağları ile görüntü rekonstrüksiyonu ve restorasyonu
Difüzyon ağları ile görüntü rekonstrüksiyonu ve restorasyonu
Dosyalar
Tarih
2025-06-22
Yazarlar
Parapan, Onur
Süreli Yayın başlığı
Süreli Yayın ISSN
Cilt Başlığı
Yayınevi
İTÜ Lisansüstü Eğitim Enstitüsü
Özet
u tez çalışması, görüntü işleme alanında son yıllarda öne çıkan difüzyon temelli derin öğrenme yaklaşımlarının, görüntü rekonstrüksiyonu ve restorasyonu görevlerindeki etkinliğini araştırmayı amaçlamaktadır. Rekonstrüksiyon ve restorasyon süreçleri, özellikle eksik, bozulmuş veya düşük çözünürlüklü görüntülerden orijinal yapının yeniden elde edilmesi amacıyla yürütülmekte olup; tıbbi görüntüleme, güvenlik sistemleri, uydu görüntüleme ve kültürel mirasın dijitalleştirilmesi gibi birçok kritik alanda büyük önem taşımaktadır. Çalışmanın başlangıcında klasik yöntemler (frekans domaini teknikleri, matris tamamlama yaklaşımları, varyasyonel modelleme vb.) detaylandırılmış, ardından derin öğrenme temelli modellerin (özellikle konvolüsyonel sinir ağları (convolutional neural network, CNN), otokodlayıcı, üretici çekişmeli ağlar (generative adversarial network, GAN) gibi) bu alanlara getirdiği katkılar ele alınmıştır. Ancak bu geleneksel ve erken dönem derin öğrenme yaklaşımlarının eksik veri, gürültü ve düşük çözünürlük gibi zorlu senaryolarda yapısal bütünlüğü ve görsel kaliteyi aynı anda sağlayamadıkları gösterilmiştir. Difüzyon modelleri, bu zorlukların üstesinden gelmek amacıyla geliştirilen, olasılıksal temelli bir görüntü üretim ve restorasyon metodudur. Bu modeller, bir görüntüyü iteratif olarak gürültülendirip yeniden oluşturarak çalışır ve özellikle yapısal tutarlılığın korunması, detayların hassas biçimde işlenmesi ve yüksek kaliteli sonuçlar üretilmesi açısından geleneksel yöntemlere göre önemli avantajlar sunar. Tezde, gürültü giderici olasılık modeli (denoising diffusion probabilistic model, DDPM), tak ve çalıştır difüzyon, artık kaydırmalı difüzyon gibi yöntemler ve gürültü giderici difüzyon restorasyon modelleri (denoising diffusion restoration models, DDRM), görüntü restorasyonu için verimli difüzyon modelleri (efficient diffusion model for image restoration, DiffIR) ve derin tak ve çalıştır görüntü restorasyonu (deep plug-and-play image restoration, DPIR) uygulamalar detaylıca incelenmiştir. Deneysel çalışmalar, difüzyon modellerinin görüntü tamamlama ve süper çözünürlük görevlerinde U-Net gibi doğrudan öğrenmeye dayalı yaklaşımlara kıyasla daha iyi yapısal benzerlik (structural similarity index measure, SSIM), piksel bazlı hata (peak signal-to-noise ratio, PSNR) ve algısal benzerlik (learned perceptual image patch similarity, LPIPS) metriklerine ulaştığını ortaya koymuştur. Özellikle artık kaydırma yöntemi ile difüzyon sürecinin hızlandırılması ve örnekleme kalitesinin korunması yönünde önemli katkılar sağlanmıştır. Bu da hem akademik literatürdeki güncel gelişmelerle uyumlu sonuçlar üretmekte hem de pratik uygulamalarda kullanım potansiyelini göstermektedir. Bu çalışmada elde edilen nicel ve nitel sonuçlar sistematik biçimde değerlendirilmiştir. U-Net ve difüzyon tabanlı modeller arasındaki karşılaştırmalı analizlerde, difüzyon modelleri birçok senaryoda daha üstün PSNR ve SSIM değerleri sağlamıştır. Görsel kalite açısından da difüzyon temelli ağların daha gerçekçi, daha az yapaylık içeren ve yapısal olarak tutarlı çıktılar ürettiği gözlemlenmiştir. Ancak difüzyon modellerinin yüksek hesaplama maliyeti, özellikle örnekleme sürecindeki zaman ve bellek yükü, hâlen çözülmesi gereken bir sınırlılık olarak öne çıkmaktadır. Bu bağlamda artık kaydırma ve latent difüzyon gibi alternatif yapılarla modelin verimliliği artırılmaya çalışılmıştır. Sonuç olarak, bu tezde difüzyon temelli modellerin yalnızca teorik olarak değil, gerçek dünya problemlerine uygulanabilirliği bakımından da güçlü bir çözüm sunduğu gösterilmiştir. Özellikle yapısal bütünlüğün kritik olduğu görevlerde klasik yöntemleri ve geleneksel derin öğrenme mimarilerini geride bırakarak daha etkili sonuçlar üretmiştir. Bu doğrultuda difüzyon modellerinin gelecekteki görüntü işleme uygulamalarında daha yaygın biçimde kullanılması beklenmektedir. Ayrıca çalışmanın sonunda sınırlılıklar tartışılmış ve hesaplama verimliliğini artırmaya yönelik öneriler sunulmuştur.
This thesis explores the effectiveness of diffusion-based deep learning models in the tasks of image reconstruction and restoration—two of the most critical problems in the field of computer vision and image processing. These tasks are essential in real-world applications such as medical imaging, satellite observation, security systems, and digital preservation of cultural heritage, where the goal is to retrieve or restore lost or degraded information with high fidelity and structural integrity. At the outset of this study, traditional image reconstruction and restoration approaches were investigated, including frequency domain techniques, matrix completion strategies, and variational methods. Although these classical algorithms offer interpretable mathematical formulations and moderate performance in ideal conditions, they tend to fall short when dealing with complex, high-dimensional, or noisy data. Especially in cases involving missing pixels, occlusions, noise corruption, or low-resolution imaging, traditional techniques cannot adequately preserve structural information while delivering visually plausible outputs. With the rise of deep learning, convolutional neural networks (CNN), autoencoders, generative adversarial networks (GAN), and U-Net architectures have significantly transformed image enhancement pipelines. These models have shown considerable improvements in tasks like denoising, inpainting, super-resolution, and deblurring, often surpassing classical methods in quantitative and qualitative metrics. However, it has been observed that such models—despite their strength in learning complex mappings—still face challenges in highly ill-posed scenarios. Specifically, they may fail to generate outputs that are both perceptually accurate and structurally faithful, especially when the input data is extremely degraded or incomplete. Diffusion models represent a new generation of probabilistic generative models that address these shortcomings by learning the data distribution through a gradual noising and denoising process. In essence, these models corrupt input images by adding Gaussian noise over several steps and then learn to reverse this process through a series of learned denoising steps. This approach, inspired by stochastic differential equations, has been shown to produce high-quality, structurally consistent outputs that align well with human perception. In this study, several diffusion-based frameworks were explored and implemented, including the denoising diffusion probabilistic model (DDPM), plug-and-play diffusion, and residual shifting diffusion, as well as advanced methods like diffusion models for image restoration (DiffIR), denoising diffusion restoration models (DDRM), and deep plug-and-play image restoration (DPIR). These models were evaluated in terms of their performance on inpainting and super-resolution tasks using standard metrics such as peak signal-to-noise ratio (PSNR), structural similarity index measure (SSIM), and learned perceptual image patch similarity (LPIPS). Experimental results revealed that diffusion models consistently outperformed U-Net-based architectures in both pixel-wise accuracy and perceptual quality. For example, in image completion scenarios, diffusion models achieved significantly higher SSIM and lower LPIPS scores, indicating better preservation of image structure and improved perceptual realism. The residual shifting technique, in particular, was effective in accelerating the sampling process without sacrificing output quality. This architectural enhancement improves sampling efficiency—a known bottleneck in diffusion models—by introducing a residual mechanism that enables faster convergence and improved visual results even in early iterations. Moreover, the diffusion models demonstrated strong generalization across different degradation types. Their probabilistic nature allowed them to model complex conditional distributions and generate outputs with richer details, even when the input data was severely corrupted. Unlike deterministic models like U-Net, which may tend to produce oversmoothed or repetitive patterns in uncertain regions, diffusion models could adaptively sample multiple plausible solutions, thereby providing outputs that were more natural and diverse while retaining consistency with the available context. The thesis further provides an extensive quantitative comparison between conventional models and diffusion-based methods, showing that the latter achieve higher PSNR and SSIM values in most test cases. In terms of perceptual similarity, diffusion models exhibited lower LPIPS values, which correspond to improved perceptual closeness to ground truth images as judged by learned deep feature representations. The improvement in LPIPS is particularly important, as it suggests that diffusion models produce images that are not only numerically accurate but also visually more convincing to human observers. Despite these advantages, diffusion models are not without their limitations. Their computational cost remains significantly higher compared to feedforward CNN-based models due to the iterative sampling process. Each denoising step requires a separate forward pass through the model, which, when accumulated across dozens or hundreds of steps, results in longer inference times and higher memory consumption. This limitation restricts the practicality of diffusion models in real-time or resource-constrained environments. To mitigate this issue, this thesis explores latent diffusion and residual shifting strategies that aim to reduce the number of sampling steps while maintaining or improving reconstruction quality. The use of autoencoders as part of latent diffusion frameworks also plays a pivotal role in reducing dimensionality and accelerating inference. By operating in a compressed latent space, diffusion models can reconstruct images with fewer steps and reduced computational burden. The integration of variational autoencoders (VAEs) with diffusion processes demonstrates promising results in balancing quality and efficiency, especially in high-resolution tasks like medical image restoration or satellite data enhancement. In conclusion, this thesis illustrates that diffusion-based models offer a compelling alternative to both traditional reconstruction methods and early deep learning architectures. Their ability to produce high-fidelity images from incomplete, noisy, or low-resolution inputs positions them as powerful tools in critical applications where visual quality and structural accuracy are non-negotiable. Particularly in tasks involving medical imaging or historical document restoration—where data loss is irreversible and reconstruction must preserve fine details—diffusion models show substantial potential. The research also suggests future directions for enhancing the efficiency and scalability of these models. Techniques such as guided sampling, conditional diffusion, hybrid transformer-diffusion architectures, and faster sampling schedulers (e.g., denoising fiffusion implicit models, DDIM, diffusion probabilistic model solver) may further optimize the performance of these systems. By addressing current computational limitations, these advancements could facilitate the broader adoption of diffusion models in both academic and industrial applications. The extended evaluations and comparative analyses presented in this thesis provide strong evidence that diffusion networks, especially when equipped with recent innovations like residual shifting and latent sampling, represent a state-of-the-art solution for image reconstruction and restoration. They not only outperform traditional methods but also set a new standard in generating visually coherent and structurally accurate images under challenging conditions.
This thesis explores the effectiveness of diffusion-based deep learning models in the tasks of image reconstruction and restoration—two of the most critical problems in the field of computer vision and image processing. These tasks are essential in real-world applications such as medical imaging, satellite observation, security systems, and digital preservation of cultural heritage, where the goal is to retrieve or restore lost or degraded information with high fidelity and structural integrity. At the outset of this study, traditional image reconstruction and restoration approaches were investigated, including frequency domain techniques, matrix completion strategies, and variational methods. Although these classical algorithms offer interpretable mathematical formulations and moderate performance in ideal conditions, they tend to fall short when dealing with complex, high-dimensional, or noisy data. Especially in cases involving missing pixels, occlusions, noise corruption, or low-resolution imaging, traditional techniques cannot adequately preserve structural information while delivering visually plausible outputs. With the rise of deep learning, convolutional neural networks (CNN), autoencoders, generative adversarial networks (GAN), and U-Net architectures have significantly transformed image enhancement pipelines. These models have shown considerable improvements in tasks like denoising, inpainting, super-resolution, and deblurring, often surpassing classical methods in quantitative and qualitative metrics. However, it has been observed that such models—despite their strength in learning complex mappings—still face challenges in highly ill-posed scenarios. Specifically, they may fail to generate outputs that are both perceptually accurate and structurally faithful, especially when the input data is extremely degraded or incomplete. Diffusion models represent a new generation of probabilistic generative models that address these shortcomings by learning the data distribution through a gradual noising and denoising process. In essence, these models corrupt input images by adding Gaussian noise over several steps and then learn to reverse this process through a series of learned denoising steps. This approach, inspired by stochastic differential equations, has been shown to produce high-quality, structurally consistent outputs that align well with human perception. In this study, several diffusion-based frameworks were explored and implemented, including the denoising diffusion probabilistic model (DDPM), plug-and-play diffusion, and residual shifting diffusion, as well as advanced methods like diffusion models for image restoration (DiffIR), denoising diffusion restoration models (DDRM), and deep plug-and-play image restoration (DPIR). These models were evaluated in terms of their performance on inpainting and super-resolution tasks using standard metrics such as peak signal-to-noise ratio (PSNR), structural similarity index measure (SSIM), and learned perceptual image patch similarity (LPIPS). Experimental results revealed that diffusion models consistently outperformed U-Net-based architectures in both pixel-wise accuracy and perceptual quality. For example, in image completion scenarios, diffusion models achieved significantly higher SSIM and lower LPIPS scores, indicating better preservation of image structure and improved perceptual realism. The residual shifting technique, in particular, was effective in accelerating the sampling process without sacrificing output quality. This architectural enhancement improves sampling efficiency—a known bottleneck in diffusion models—by introducing a residual mechanism that enables faster convergence and improved visual results even in early iterations. Moreover, the diffusion models demonstrated strong generalization across different degradation types. Their probabilistic nature allowed them to model complex conditional distributions and generate outputs with richer details, even when the input data was severely corrupted. Unlike deterministic models like U-Net, which may tend to produce oversmoothed or repetitive patterns in uncertain regions, diffusion models could adaptively sample multiple plausible solutions, thereby providing outputs that were more natural and diverse while retaining consistency with the available context. The thesis further provides an extensive quantitative comparison between conventional models and diffusion-based methods, showing that the latter achieve higher PSNR and SSIM values in most test cases. In terms of perceptual similarity, diffusion models exhibited lower LPIPS values, which correspond to improved perceptual closeness to ground truth images as judged by learned deep feature representations. The improvement in LPIPS is particularly important, as it suggests that diffusion models produce images that are not only numerically accurate but also visually more convincing to human observers. Despite these advantages, diffusion models are not without their limitations. Their computational cost remains significantly higher compared to feedforward CNN-based models due to the iterative sampling process. Each denoising step requires a separate forward pass through the model, which, when accumulated across dozens or hundreds of steps, results in longer inference times and higher memory consumption. This limitation restricts the practicality of diffusion models in real-time or resource-constrained environments. To mitigate this issue, this thesis explores latent diffusion and residual shifting strategies that aim to reduce the number of sampling steps while maintaining or improving reconstruction quality. The use of autoencoders as part of latent diffusion frameworks also plays a pivotal role in reducing dimensionality and accelerating inference. By operating in a compressed latent space, diffusion models can reconstruct images with fewer steps and reduced computational burden. The integration of variational autoencoders (VAEs) with diffusion processes demonstrates promising results in balancing quality and efficiency, especially in high-resolution tasks like medical image restoration or satellite data enhancement. In conclusion, this thesis illustrates that diffusion-based models offer a compelling alternative to both traditional reconstruction methods and early deep learning architectures. Their ability to produce high-fidelity images from incomplete, noisy, or low-resolution inputs positions them as powerful tools in critical applications where visual quality and structural accuracy are non-negotiable. Particularly in tasks involving medical imaging or historical document restoration—where data loss is irreversible and reconstruction must preserve fine details—diffusion models show substantial potential. The research also suggests future directions for enhancing the efficiency and scalability of these models. Techniques such as guided sampling, conditional diffusion, hybrid transformer-diffusion architectures, and faster sampling schedulers (e.g., denoising fiffusion implicit models, DDIM, diffusion probabilistic model solver) may further optimize the performance of these systems. By addressing current computational limitations, these advancements could facilitate the broader adoption of diffusion models in both academic and industrial applications. The extended evaluations and comparative analyses presented in this thesis provide strong evidence that diffusion networks, especially when equipped with recent innovations like residual shifting and latent sampling, represent a state-of-the-art solution for image reconstruction and restoration. They not only outperform traditional methods but also set a new standard in generating visually coherent and structurally accurate images under challenging conditions.
Açıklama
Tez (Yüksek Lisans)-- İstanbul Teknik Üniversitesi, Lisansüstü Eğitim Enstitüsü, 2025
Anahtar kelimeler
telekomünikasyon,
telecommunications