Neural rendering methods can achieve near-photorealistic image synthesis of scenes from posed input images. However, when the images are imperfect, e.g., captured in very low-light conditions, state-of-the-art methods fail to reconstruct high-quality 3D scenes.
Recent approaches have tried to address this limitation by modeling various degradation processes in the image formation model; however, this limits them to specific image degradations. In this paper, we propose a generalizable neural rendering method that can perform high-fidelity novel view synthesis under several degradations. Our method, GAURA, is learning-based and does not require any test-time scene-specific optimization. It is trained on a synthetic dataset that includes several degradation types. GAURA outperforms state-of-the-art methods on several benchmarks for low-light enhancement, dehazing, deraining, and on-par for motion deblurring. Further, our model can be efficiently fine-tuned to any new incoming degradation using minimal data. We thus demonstrate adaptation results on two unseen degradations, desnowing and removing defocus blur.
Existing work on novel view synthesis and restoration often explicitly models physical degradation into the rendering equation, which lacks generalization across different degradation types. We propose an alternative approach, implicitly modeling image formation under various corruption types into learnable parameters. Instead of creating multiple network clones for each degradation type, we encode degradation-specific information into latent codes that interact with shared network parameters. Inspired by HyperNetworks, these latent codes map to weights of an MLP for degradation-specific transformations. This Degradation-aware Latent Module (DLM) efficiently captures different imperfections with minimal extra parameters.
While we use a set of degradation-aware latents to encode imperfection-specific information, they are independent of the actual input captures. We argue that various variations exist within each corruption type, making the previous static approach potentially suboptimal. Therefore, we propose adding a residual feature S obtained from the input view closest to the target view to be rendered. We reformulate this by using pool(⋅) to denote the global average pooling operation, and Fresidue to denote a small convolutional network that encodes the input view closest to the target viewing angle Inearest.
We build our method on the recent SOTA GNVS technique, GNT, which follows a three-stage pipeline involving a UNet-based feature extractor, a view transformer, and a ray transformer. To enhance these components with degradation-specific priors, we integrate DLM blocks at each decoder level of the UNet, enriching input features with information about input degradations. In the view transformer, we replace the vanilla MLP feature mappings with DLM modules to incorporate degradation priors into the scene representation. Finally, in the ray transformer, we use DLM modules for value feature mapping, ensuring that the learned aggregation represents a generalized volume rendering process, while keeping the underlying scene geometry independent of input imperfections.
@article{gupta2024gaura,
title={GAURA: Generalizable Approach for Unified Restoration and Rendering of Arbitrary Views},
author={Gupta, Vinayak and Simhachala Venkata Girish, Rongali and Varma T, Mukund and Tewari, Ayush and Mitra, Kaushik},
journal={arXiv preprint arXiv:2407.08221},
year={2024}
}