NAVI: Category-Agnostic Image Collections with High-Quality 3D Shape and Pose Annotations

Varun Jampani* Kevis-Kokitsi Maninis* Andreas Engelhardt Arjun Karpur Karen Truong Kyle Sargent Stefan Popov Andre Araujo Ricardo Martin-Brualla Kaushal Patel Daniel Vlasic Vittorio Ferrari Ameesh Makadia Ce Liu Yuanzhen Li Howard Zhou
(*equal contribution)

Google

NAVI dataset consists of both in-the-wild and multi-view image collecitons with high-quality aligned 3D shape ground-truths

[Paper] [Supplementary] [Dataset**] [License] [BibTeX]

**The dataset has been updated with video captures.

Abstract

Recent advances in neural reconstruction enable high-quality 3D object reconstruction from casually captured image collections. Current techniques mostly analyze their progress on relatively simple image collections where SfM techniques can provide ground-truth (GT) camera poses. We note that SfM techniques tend to fail on in-the-wild image collections such as image search results with varying backgrounds and illuminations. To enable systematic research progress on 3D reconstruction from casual image captures, we propose `NAVI': a new dataset of category-agnostic image collections of objects with high-quality 3D scans along with per-image 2D-3D alignments providing near-perfect GT camera parameters. These 2D-3D alignments allow us to extract accurate derivative annotations such as dense pixel correspondences, depth and segmentation maps. We demonstrate the use of NAVI image collections on different problem settings and show that NAVI enables more thorough evaluations that were not possible with existing datasets. We believe NAVI is beneficial for systematic research progress on 3D reconstruction and correspondence estimation.

Dataset Highlights

NAVI dataset consists of casually captured image collections with high-quality 3D shape and pose annotations. The dataset consists of multi-view and in-the-wild image collections, as well as videos of 36 objects with around 29K images in total. Here are some key aspects of the dataset:

In-the-wild. In addition to typical multi-view object images, NAVI provides in-the-wild images collections where objects are captured under varying backgrounds, illuminations and cameras.
Category-agnostic. Objects in the NAVI dataset are category-agnostic with image collections of toys and decoration items that do not have any category-specific shapes.
Near-perfect 3D geometry. We use high-quality 3D scanners to get 3D shape ground-truth.
Near-perfect camera poses. We obtain high-quality 3D camera pose annotations with manual 2D-3D alignment along with rigorous verification.
Derivative annotations such as dense correspondences, depth etc. Given the near-perfect 3D shape and camera parameters, one could easily derive other high-quality annotations such dense pixel level correspondences, monocular depth, foreground segmentation etc.

Sample NAVI dataset images and the corresponding 3D shape alignments

Wild Image Collections

Wild image collections such as image search results or product catelogue photos are readily available in the internet and does not require any active capture efforts. To advance research on 3D shape and pose estimation from such in-the-wild online image collections, we provide image collections where the objects are captured under unique backgrounds, illuminations and camera settings.

Sample in-the-wild image collections in the NAVI dataset with varying backgrounds, illuminations and cameras

Multi-View Image Collections

Most existing multi-view image collections for 3D reconstruction assume that standard structure from motion pipelines such as COLMAP works well to obtain high-quality camera poses. This may not be the case for casually captured images. In the NAVI dataset, we also captured and 3D annotated multi-view image collections that can help in further research on joint shape and camera pose estimation.

Sample multi-view image collections in the NAVI dataset which are captured with hand-held cameras in natural settings

Video Scenes

Similar to multi-view image collections, we collected and annotated video captures of scenes. Videos naturally contain blurrier frames than image collections, which can provide different challenges for 3D reconstruction.

Sample video scenes in the NAVI dataset

Dense Pixel Correspondences

Given the high-quality shape annotations, we can compute dense per-pixel correspondences across different object images. This is in contrast to most existing correspondence datasets that use sparse keypoint annotations for evaluations.

Depths and Segmentations

We can also get high-quality derivative annotations such as object masks and metric depths given the ground-truth 2D-3D alignments in the NAVI dataset.

Sample NAVI images and the corresponding object segmentation masks and depth maps

BibTex

If you find this dataset useful, please consider citing our work:

 @inproceedings{jampani2023navi,

    title={NAVI: Category-Agnostic Image Collections with High-Quality 3D Shape and Pose Annotations},

    author={Jampani, Varun and Maninis, Kevis-Kokitsi and Engelhardt, Andreas and Karpur, Arjun and Truong, Karen and Sargent, Kyle and Popov, Stefan and Araujo, Andre and Martin-Brualla, Ricardo and Patel, Kaushal and Vlasic, Daniel and Ferrari, Vittorio and Makadia, Ameesh and Liu, Ce and Li, Yuanzhen and Zhou, Howard},

    booktitle={NeurIPS},

    url={https://navidataset.github.io/},

    year={2023}

  }