# Volumetric Bundle Adjustment for Online Photorealistic Scene Capture

1Imperial College London

## We propose a system that can reconstruct photorealistic volumetric representation of complex scenes in an efficient manner. Our system processes images online, i.e. it can obtain a good quality estimate of both the scene geometry and appearance at roughly the same rate the video is captured, making it ideally suited for mobile phone capture.

Three example scenes reconstructed using VBA from an iPhone 13 video and ARkit initial poses and depth maps.

## Abstract

Efficient photorealistic scene capture is a challenging task. Current dense SLAM systems can operate very efficiently, but images generated from the models captured by these systems are often not photorealistic. Recent approaches based on neural volume rendering can render novel views at high fidelity, but they often require a long time to train, making them impractical for applications that require real-time scene capture. We propose a method than can reconstruct photorealistic volumes in near realtime.

## VBA Overview

Our contributions are the following,
1. VDB feature volume: we propose a hierarchical feature volume using VDB grids. This representation is memory efficient and allows for fast querying of the scene information.
2. Approximate second-order optimizer: we introduce a novel optimization approach that improves the efficiency of the bundle adjustment which allows our system to converge to the target camera poses and scene geometry much faster.

### Method explanation

We represent the scene as a neural VDB that stores spatial scene features. These features encode the shape and appearance of the scene in a high-dimensional feature space. Raymarching is used to render views of the volume specified by a camera pose. During rendering the features are sampled from the volume using trilinear interpolation and a shallow MLP is used to project these features to color and occupancy values. The estimated pixel color is computed by integrating the color values along the ray weighted by the occupancy and visibility. This is repeated for each pixel in the image. Once all the rays are rendered, the residuals are computed as the difference between the estimated and ground truth images. The residuals are then minimized using volumetric bundle adjustment (VBA), which efficiently refines the volume and camera pose parameters.

## Experimental evaluation

### Comparison to view synthesis approaches

We show the results of our method, NeRF, MVSNeRF and NSVF at three different times during optimization corresponding to the times. The last column shows the error image. Our method and MVSNeRF perform the best in terms of visual quality, however, MVSNeRF only constructs a local (view-based) representation of the scene wheras our method constructs a global representation.

### Depth quality

We show the depth image estimated using (a) an off-the-shelf monocular depth prediction method (MiDAS), (b) the final depth estimated by our method and (c) the depth obtained from the iPhone 12 Pro LiDAR for comparison.

### Comparison to voxel-baed reconstruction systems

Comparison to three real-time RGB-D reconstruction systems. We show three RGB-D systems (a) KinectFusion which fuses depth readings into a dense voxel grid, (b) the colormap optimization approach of which optimizes camera poses to improve texture consistency, (c) PolyCamAI which is a commercial iPhone ARKit-based app and (d) our method. The inset shows the processing time for each method, and the time taken to capture the video. Our method achieves the best quality scene reconstruction.

### Optimizer comparison

Visualization of the convergence of the novel approximate second-order optimizer used in our volumetric bundle adjustment (VBA) compared to ADAM.

## Related Works

Since the submission of this paper in November 2021, a number of excellent related papers have been released. Particularly these are:
• Instant-NGP: introduces a hashing function to reduce the memory and computational cost of using very high resolution voxel grids. The hashing could be used in our VDB grid to reduce memory.
• DirectVoxGO: also uses a shallow MLP but uses Adam and stores features in a standard dense voxel grid. Our optimizer could be used to improve speed of convergence further.
• ReLU Fields: uses only a dense voxel grid with ReLU activation for storing volume (no MLP) and spherical harmonics for view depence.

## Bibtex

@InProceedings{Clark_2022_CVPR,
author    = {Clark, Ronald},
title     = {Volumetric Bundle Adjustment for Online Photorealistic Scene Capture},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month     = {June},
year      = {2022},
pages     = {6124-6132}
}

## Acknowledgements

The research was supported by an Imperial College Research Fellowship (ICRF).

Webpage template from HyperNeRF and ReLU Fieldsproject webpages.