ConceptFusion: Open-set Multimodal 3D Mapping

Robotics: Science and Systems (RSS) 2023

1MIT, 2Université de Montréal, 3University of Toronto, 4IIIT Hyderabad, 5CMU, 6Amazon, 7Matician, 8DEVCOM Army Research Laboratory
*Co-second authors Work done prior to current affiliation

ConceptFusion builds open-set 3D maps that can be queried via text, click, image, or audio.

Video

Abstract

Building 3D maps of the environment is central to robot navigation, planning, and interaction with objects in a scene. Most existing approaches that integrate semantic concepts with 3D maps largely remain confined to the closed-set setting: they can only reason about a finite set of concepts, pre-defined at training time. Further, these maps can only be queried using class labels, or in recent work, using text prompts.

We address both these issues with ConceptFusion, a scene representation that is: (i) fundamentally open-set, enabling reasoning beyond a closed set of concepts (ii) inherently multi-modal, enabling a diverse range of possible queries to the 3D map, from language, to images, to audio, to 3D geometry, all working in concert. ConceptFusion leverages the open-set capabilities of today’s foundation models pre-trained on internet-scale data to reason about concepts across modalities such as natural language, images, and audio. We demonstrate that pixel-aligned open-set features can be fused into 3D maps via traditional SLAM and multi-view fusion approaches. This enables effective zero-shot spatial reasoning, not needing any additional training or finetuning, and retains long-tailed concepts better than supervised approaches, outperforming them by more than 40% margin on 3D IoU. We extensively evaluate ConceptFusion on a number of real-world datasets, simulated home environments, a real-world tabletop manipulation task, and an autonomous driving platform. We showcase new avenues for blending foundation models with 3D open-set multimodal mapping.

Approach

Construct pixel-aligned features

ConceptFusion constructs pixel-aligned features from off-the-shelf foundation models that can only produce a global (image-level) embedding vector. This is achieved by: processing input images to generate generic (class-agnostic) object masks and extracting a local features for each, computing a global feature for the input image as a whole, and fusing the region-specific features with global features using our proposed zero-shot pixel alignment technique.


Zero-shot pixel alignment

For each image, the global (fG) and local (fL) features are fused to obtain our pixel-aligned features (fP). Top-left: We first compute cosine similarities between each local feature (fL) with the global feature (fG). Top-right: We compute an inter-feature similarity matrix, and compute the average similarity of each local feature to every other local feature, denoted φ̄i . Bottom-left: We combine these similarities to produce weights for fusing fG and fL to obtain pixel-aligned features fP.

Retaining fine-grained concepts

Our approach to computing pixel-aligned features is adept at capturing long-tailed and fine-grained concepts. The plots to the right show the similarity scores between the embeddings of the cropped image regions corresponding to diet coke, lysol, and yogurt and their text embeddings, predicted by the base CLIP model used by LSeg and OpenSeg respectively. This implies that the base CLIP models know these concepts, yet, as can be seen on the tiled plots (center), LSeg and OpenSeg are not able to retrieve these concepts; they forget the concepts when finetuned. On the other hand, our zero-shot pixel-alignment approach does not suffer this drawback, and clearly delineates the corresponding pixels.

UnCoCo dataset

To evaluate long-tailed reasoning and multimodal reasoning abilities, since there is no existing dataset, we capture a set of 20 RGB-D sequences comprising 78 commonly found objects and annotate them with text, audio, click, and image queries. For each query, we also provide the corresponding ground truth 2D and 3D retrieval results. This image showcases sample tabletop scenes from UnCoCo (left) and the resulting 3D reconstructions and labels (right).

3D spatial reasoning

What is the distance between the refrigerator from the television?

A key benefit of lifting foundation features to 3D is the ability to reason about spatial attributes. We implement a set of generic spatial relationship comparators that can be leveraged for querying arbitrary objects. We employ a large language model to parse the queries to generate function calls that can directly be executed. E.g., the query above parses to howFar(refrigirator, television).

Long-form text queries

ConceptFusion is able to handle long-form text queries and accurately localize objects referenced by the query. In the first two scenarios, OpenSeg is distracted by the presence of several confounding attributes. The third scenario shows a single world query (television) that is part of the COCO Captions dataset used to train OpenSeg, providing it an unfair advantage. ConceptFusion, nonetheless, accurately assigns the highest response to the map points representing the television. In each query, the referenced object is boldfaced.


Click-queries

Click-queries over a sequence from the ICL dataset. For each clicked point, we compute the cosine similarity of the embedding at that point with that of every other map point and visualize them using a 'jet' colormap. Points in red indicate greatest similarity, while points in blue indicate least similarity. Notice the consistency in semantic concepts. For instance, when we click on a point on the corner lamp (at about 0:45), we also notice that the other corner lamp, as well as lights on top of the ceiling get high similarities assigned.


Experiments on real robotic systems

Tabletop rearrangement

The robot is provided with rearrangment goals involving novel objects. (Top row) push goldfish to the right of the yellow line, where goldfish refers to the brandname of the pack of Cheddar snack. (Bottom row) push baymax to the right of the yellow line, where baymax refers to the plush toy depicting the famous Disney character.

Autonomous driving

(Left to right; top to bottom) Autonomous drive-by-wire platform deployed; pointcloud map of the environment with the response to the openset text-query ”football field” (shown in red); path found to the football field (shown in red); car successfully navigates to the destination autonomously. See our anonymized webpage for more results.

Integrating ConceptFusion with Large Language Models

Concurrent work

Given the pace of AI research these days, it is extremely challenging to keep up with all of the work around foundation models and open-set perception. We list below a few key approaches that we have come across after beginning work on ConceptFusion. If we may have inadvertently missed out on key concurrent work, please reach out to us over email (or better, open a pull request on our GitHub page).

CLIP-Fields encodes features from language and vision-language models into a compact, scene-specific neural network trained to predict feature embeddings from 3D point coordinates; to enable open-set visual understanding tasks.

OpenScene demonstrates that features from pixel-aligned 2D vision-language models can be distilled to 3D, generalize to new scenes, and perform better than their 2D counterparts.

Deng et al. demonstrate interesting ways of learning hierarchical scene abstractions by distilling features from 2D vision-language foundation models, and smart ways of interpreting captions from 2D captioning approaches.

Feature-realistic neural fusion demonstrates the integration of DINO features into a real-time neural mapping and positioning system.

Semantic Abstraction uses CLIP features to generate 3D features for reasoning over long-tailed categories, for scene completion and detecting occluded objects from language.

Say-Can demonstrates the applicability of large language models as task-planners, and leverage a set of low-level skills to execute these plans in the real world. Also related to this line of work are VL-Maps, NLMap-SayCan, and CoWs, which demonstrate the benefits of having a map queryable via language.

Language embedded radiance fields (LERF) trains a NeRF that additionally encodes CLIP and DINO features for language-based concept retrieval.

3D concept learning from multi-view images (3D-CLR) introduces a dataset for 3D multi-view visual question answering, and proposes a concept learning framework that leverages pixel-aligned language embeddings from LSeg. They additionally train a set of neurosymbolic reasoning modules that loosely inspire our spatial query modules.

BibTeX

@article{conceptfusion,
  author    = {Jatavallabhula, {Krishna Murthy} and Kuwajerwala, Alihusein and Gu, Qiao and Omama, Mohd and Chen, Tao and Li, Shuang and Iyer, Ganesh and Saryazdi, Soroush and Keetha, Nikhil and Tewari, Ayush and Tenenbaum, {Joshua B.} and {de Melo}, {Celso Miguel} and Krishna, Madhava and Paull, Liam and Shkurti, Florian and Torralba, Antonio},
  title     = {ConceptFusion: Open-set Multimodal 3D Mapping},
  journal   = {Robotics: Science and Systems (RSS)},
  year      = {2023},
}