Efficient photorealistic scene capture is a challenging task. Current dense SLAM systems can operate very efficiently, but images generated from the models captured by these systems are often not photorealistic. Recent approaches based on neural volume rendering can render novel views at high fidelity, but they often require a long time to train, making them impractical for applications that require real-time scene capture. We propose a method than can reconstruct photorealistic volumes in near realtime.
We represent the scene as a neural VDB that stores spatial scene features. These features encode the shape and appearance of the scene in a high-dimensional feature space. Raymarching is used to render views of the volume specified by a camera pose. During rendering the features are sampled from the volume using trilinear interpolation and a shallow MLP is used to project these features to color and occupancy values. The estimated pixel color is computed by integrating the color values along the ray weighted by the occupancy and visibility. This is repeated for each pixel in the image. Once all the rays are rendered, the residuals are computed as the difference between the estimated and ground truth images. The residuals are then minimized using volumetric bundle adjustment (VBA), which efficiently refines the volume and camera pose parameters.
We show the results of our method, NeRF, MVSNeRF and NSVF at three different times during optimization corresponding to the times. The last column shows the error image. Our method and MVSNeRF perform the best in terms of visual quality, however, MVSNeRF only constructs a local (view-based) representation of the scene wheras our method constructs a global representation.
We show the depth image estimated using (a) an off-the-shelf monocular depth prediction method (MiDAS), (b) the final depth estimated by our method and (c) the depth obtained from the iPhone 12 Pro LiDAR for comparison.
Comparison to three real-time RGB-D reconstruction systems. We show three RGB-D systems (a) KinectFusion which fuses depth readings into a dense voxel grid, (b) the colormap optimization approach of which optimizes camera poses to improve texture consistency, (c) PolyCamAI which is a commercial iPhone ARKit-based app and (d) our method. The inset shows the processing time for each method, and the time taken to capture the video. Our method achieves the best quality scene reconstruction.
Visualization of the convergence of the novel approximate second-order optimizer used in our volumetric bundle adjustment (VBA) compared to ADAM.
@InProceedings{Clark_2022_CVPR,
author = {Clark, Ronald},
title = {Volumetric Bundle Adjustment for Online Photorealistic Scene Capture},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2022},
pages = {6124-6132}
}
The research was supported by an Imperial College Research Fellowship (ICRF).
Webpage template from HyperNeRF and ReLU Fieldsproject webpages.