Selective Visual Representations Improve Convergence and Generalization for Embodied-AI

1University of Washington, 2Allen Institute for Artifical Intelligence

*Equal Contribution


ICLR, 2024 [Spotlight]

ArXiv Code Slides


Interpolate start reference image.

Inspired by selective attention in humans—the process through which people filter their perception based on their experiences, knowledge, and the task at hand—we introduce a parameter-efficient approach to filter visual stimuli for Embodied-AI.

Abstract


Embodied-AI models often employ off the shelf vision backbones like CLIP to encode their visual observations. Although such general purpose representations encode rich syntactic and semantic information about the scene, much of this information is often irrelevant to the specific task at hand. This introduces noise within the learning process and distracts the agent's focus from task-relevant visual cues. Inspired by selective attention in humans—the process through which people filter their perception based on their experiences, knowledge, and the task at hand—we introduce a parameter-efficient approach to filter visual stimuli for embodied-AI. Our approach induces a task-conditioned bottleneck using a small learnable codebook module. This codebook is trained jointly to optimize task reward and acts as a task-conditioned selective filter over the visual observation. Our experiments showcase state-of-the-art performance for Object Goal Navigation and Object Displacement across 5 benchmarks, ProcTHOR, ArchitecTHOR, RoboTHOR, AI2-iTHOR, and ManipulaTHOR. The filtered representations produced by the codebook are also able to generalize better and converge faster when adapted to other simulation environments such as Habitat. Our qualitative analyses show that agents explore their environments more effectively and their representations retain task-relevant information like target object recognition while ignoring superfluous information about other objects.


Introductory video (5min) . English captions are available in the video settings.

The Codebook Module

A Filtering Mechanism of Visual Representations for Embodied-AI

Conventional embodied-AI frameworks usually employ general-purpose visual backbones like CLIP to extract the visual representations from the input. Such representations capture an abundance of details and a significant amount of task-irrelevant information. For example, to find a specific object in a house, the agent doesn't need to know about other distractor objects in the agent’s view, about their colors, materials, attributes, etc. These distractions introduce unnecessary noise into the learning process, distracting the agent’s focus away from more pertinent visual cues. We draw from the substantial body of research in cognitive psychology to induce selective task-specific representations that filter irrelevant sensory input and only retain the necessary stimuli.
We introduce a compact learnable module that decouples the two objectives in embodied-AI tasks across different parameters in the network:

  1. The input encoders and the codebook focus on extracting salient information for the task from the visual input
  2. Whereas the policy (RNN and actor-critic heads) can focus on decision-making based on this filtered information.
It acts as a task-conditioned bottleneck that filters out unnecessary information, allowing the agent to focus on more task-related visual cues.


Interpolate start reference image.


Results

Codebook-Based Representations Improve Performance in Embodied-AI

Bottlenecking the task-conditioned embeddings using our codebook module results in significant improvements over the non-bottlenecked representations across a variety of Embodied-AI benchmarks. We consider Object Goal Navigation (navigate to find a specific object category in a scene) and Object Displacement (bringing a source object to a destination object using a robotic arm) across 5 benchmarks (ProcTHOR, ArchitecTHOR, RoboTHOR, AI2-iTHOR, and ManipulaTHOR).


Introducing New Metrics for Object Navigation. We present Curvature (k) defined as

\[ k = \frac{dx \times ddy - dy \times ddx}{(dx^2 + dy^2)^{\frac{3}{2}}} \]
as a key evaluation metric comparing the smoothness of the trajectories. Smoother trajectories are generally safer and more energy-efficient.

We further introduce Success Weighted by Episode Length (SEL),
\[ \frac{1}{N} \sum_{i=1}^N S_i \frac{w_i}{max(w_i, e_i)} \]
as a substitute for Success Weighted by Path Length (SPL). SPL is evaluated based on the distance traveled rather than the actual number of steps taken while the efficiency of a path should consider factors like time and energy consumption and therefore take into account actions such as rotations and look-ups/downs.

Research Data Table
Benchmark (Object Goal Navigation) Model SR(%) ▲ EL ▼ Curvature ▼ SEL ▲
ProcTHOR-10k (test) EmbCLIP
+Codebook (Ours)
67.70
73.72
182.00
136.00
0.58
0.23
36.00
43.69
ArchitecTHOR (0-shot) EmbCLIP
+Codebook (Ours)
55.80
58.33
222.00
174.00
0.49
0.20
20.57
28.31
AI2-THOR (0-shot) EmbCLIP
+Codebook (Ours)
70.00
78.40
121.00
86.00
0.29
0.16
21.45
26.76
RoboTHOR (0-shot) EmbCLIP
+Codebook (Ours)
51.32
55.00
-
-
-
-
-
-

Benchmark (Object Displacement) Model PU(%) ▲ SR(%) ▲
ManipulaTHOR m-VOLE
+Codebook (Ours)
81.20
86.00
59.60
65.10


Codebook-Bottlenecked Embedding is Easier to Transfer to New Visual Domains

The codebook-based embedding transfers across new visual domains without exhaustive fine-tuning. Our codebook bottleneck effectively decouples the process of learning salient visual information for the task from the process of decision-making based on this filtered information. Consequently, when faced with a similar task in a new visual domain, the need for adaptation is significantly reduced. In this scenario, only the modules responsible for extracting essential visual cues in the new domain require fine-tuning, while the decision-making modules can remain fixed. We show our ObjectNav agent trained in AI2THOR simulator can effectively adapt to the Habitat simulator (which is very different in visual characteristics, lighting, textures and other environmental factors) by merely fine-tuning a lightweight Adaptation Module.


Interpolate start reference image.

Lightweight Fine-tuning of the Adaptation Module. We only finetune a few CNN layers, action and goal embedders, and the codebook scoring function when moving to a new visual domain.



Codebook Encodes Only the Most Important Information to the Task

We conduct an analysis (through linear probing, GradCAM attention visualization, and nearest-neighbor retrieval) to explore the information encapsulated within our bottlenecked representations after training for Object Goal Navigation task. The results show that our codebook-bottlenecked representations effectively exclude information related to distracting visual cues and object categories other than the specified goal while only concentrating on the target object and encoding better information about object goal visibility and proximity to the agent.


Interpolate start reference image.

GradCAM Attention Visualization. While EmbCLIP ObjectNav agent is distracted by different objects and other visual cues even though the target object is visible in the frame, the codebook module helps the agent to effectively ignore such distractions and only focus on the object goal.



Interpolate start reference image.

Nearest-Neighbor Retrieval in the Goal-Conditioned Embedding Space. The 4 examples show that EmbCLIP-Codebook prioritizes task semantics while EmbCLIP focuses on scene semantics. In the top row, our nearest neighbors are based on object goal visibility and goal proximity to the agent whereas EmbCLIP nearest neighbors are based on the overall semantics of the scene (tables in the left or toilets far away). In the bottom row, our nearest neighbors favor the overall scene layout whereas EmbCLIP mostly focuses on colors and appearances.



Our Agent Explores More Effectively and Travels in Smoother Trajectories

We conduct a quantitative and qualitative analysis to compare the agent's behavior. The Curvature and Success Weighted by Episode Length (SEL) metrics show that our agent explores more effectively and travels in much smoother paths. Excessive rotations and sudden changes in direction can lead to increased energy consumption and increase the chances of collisions with other objects. Lower SEL achieved by our agent shows that we can find the target object in much fewer steps. The qualitative examples below show that the baseline agent performs lots of redundant rotations.


EmbCLIP-Codebook

PNG Image GIF Image

EmbCLIP

PNG Image GIF Image


EmbCLIP-Codebook

PNG Image GIF Image

EmbCLIP

PNG Image GIF Image


EmbCLIP-Codebook

PNG Image GIF Image

EmbCLIP

PNG Image GIF Image


Our agent explores the environment much more effectively and travels in much smoother trajectories. Whereas the EmbCLIP baseline agent makes many redundant rotations.


BibTeX

@article{eftekhar2023selective,
  title={Selective Visual Representations Improve Convergence and Generalization for Embodied AI},
  author={Eftekhar, Ainaz and Zeng, Kuo-Hao and Duan, Jiafei and Farhadi, Ali and Kembhavi, Ani and Krishna, Ranjay},
  journal={arXiv preprint arXiv:2311.04193},
  year={2023}
}