
The human visual system operates using various opponent processes, present in both the retina and visual cortex. These processes heavily rely on distinctions in color, luminance, or motion to trigger salient reactions. Contrast, which refers to differences in luminance and/or color that enable the differentiation of objects, plays a crucial role in subjectively evaluating image quality. Images and videos captured in low-light conditions often exhibit poor quality and visibility due to limitations in shutter angles, high ISO resulting in noise, and spectral biasing toward blue. Traditional enhancement techniques tend to wash out details, flatten the appearance, and amplify noise.
This project aims to develop and validate a perceptually inspired deep learning framework for joint restoration of noisy, low light content (targeting natural history filmmaking) ensuring temporal consistency in terms of colour, luminance and motion.

DWTA-Net is a recurrent low-light video enhancement framework designed for extreme noise. Its two-stage design first restores local structure and colour through multi-frame alignment and Mamba-based enhancement, then performs recurrent refinement using a dynamic weight-based temporal aggregation guided by optical flow that adapts to motion. A texture-adaptive loss preserves fine detail in textured regions while suppressing noise in homogeneous areas, yielding stronger noise suppression and fewer artifacts than state-of-the-art methods.

We propose a Bayesian Enhancement Model (BEM) that leverages Bayesian Neural Networks (BNNs) to model data uncertainty and generate diverse outputs. For efficient inference, we adopt a BNN–DNN framework, where a BNN captures the one-to-many mapping in a low-dimensional space, followed by a deterministic network that refines fine-grained details.

TempRetinex is an unsupervised Retinex-based video enhancement framework that exploits inter-frame correlations. It introduces Brightness Consistency Preprocessing to align intensity across exposures, improving robustness to varied lighting. A multiscale temporal consistency loss with occlusion-aware masking enforces frame coherence, while Reverse Inference and Self-Ensemble further enhance temporal stability and denoising.

This work introduces a conditional diffusion model for low-light video enhancement equipped with wavelet interscale attentions. By operating in the wavelet domain and conditioning the diffusion process on the degraded input, the method jointly restores illumination and fine detail while suppressing noise and preserving temporal coherence. The approach is released as BVI-CDM.
BVI-Mamba is an enhancement framework that leverages the Visual State Space (VSS) model to reduce memory usage and computational time. It comprises a feature alignment module, which registers spatio-temporal displacement between input frames in the feature space, and a UNet-like enhancement module for noise removal and brightness adjustment in which all convolutional layers are replaced by VSS blocks. It outperforms Transformer- and convolution-based models on both low-light and underwater video enhancement.

This work introduces a conditional diffusion model for low-light video enhancement equipped with wavelet interscale attentions. By operating in the wavelet domain and conditioning the diffusion process on the degraded input, the method jointly restores illumination and fine detail while suppressing noise and preserving temporal coherence. The approach is released as BVI-CDM.

PocketDVDNet is a lightweight, real-time video denoiser built with a model-compression framework that combines sparsity-guided structured pruning, a physics-informed noise model, and knowledge distillation. Starting from a reference model, sparsity is induced and channels are pruned, then a teacher is retrained on realistic multi-component sensor noise so the student learns implicit noise handling without explicit noise-map inputs. It reduces model size by 74% while improving denoising quality and processing 5-frame patches in real time.

DaBiT presents a novel map-guided transformer that, together with image propagation, effectively leverages the continuous spatial variation of focal blur to restore degraded footage. We also introduce a flow re-focusing module designed to efficiently align relevant features between blurry and sharp domains. In addition, we propose a novel synthetic focal blur data generation technique, which broadens the model’s learning capabilities and improves its robustness across a wider range of content.

ELVIS enables domain adaptation of state-of-the-art video instance segmentation (VIS) models to low-light scenarios. It comprises an unsupervised synthetic low-light video pipeline that models both spatial and temporal degradations, a calibration-free degradation profile estimation network (VDP-Net), and an enhancement decoder head that disentangles degradations from content features. ELVIS improves performance by up to +3.7 AP on the synthetic dataset and beats two-stage baselines by at least +2.8 AP on real low-light videos.

This method performs instance segmentation directly on low-light imagery by embedding weighted non-local blocks (wNLB) in the feature extractor, enabling an inherent denoising process at the feature level. Because denoising happens in feature space, the approach removes the need for aligned ground-truth images and can be trained on real-world low-light data. Additional learnable per-layer weights adapt the network to real noise characteristics across feature scales, improving AP by at least +7.6 over pretrained detectors.
This paper presents a comprehensive study examining the impact of these distortions on automatic object trackers. Additionally, we propose a solution to enhance the tracking performance by integrating denoising and low-light enhancement methods into the transformer-based object tracking system.

This work proposes a Degradation Estimation Network (DEN) that synthetically generates realistic standard RGB (sRGB) noise without requiring camera metadata, by estimating the parameters of physics-informed noise distributions in a self-supervised manner. This zero-shot pipeline produces synthetic noisy content with a diverse range of realistic noise characteristics, rather than merely replicating a single training distribution. Evaluated across synthetic noise replication, video enhancement, and object detection, it delivers improvements of up to 24% KLD, 21% LPIPS, and 62% AP.