Embodied AI Codebook

Embodied-AI models often employ off the shelf vision backbones like CLIP to encode their visual observations. Although such general purpose representations encode rich syntactic and semantic information about the scene, much of this information is often irrelevant to the specific task at hand. This introduces noise within the learning process and distracts the agent's focus from task-relevant visual cues. Inspired by selective attention in humans—the process through which people filter their perception based on their experiences, knowledge, and the task at hand—we introduce a parameter-efficient approach to filter visual stimuli for embodied-AI. Our approach induces a task-conditioned bottleneck using a small learnable codebook module. This codebook is trained jointly to optimize task reward and acts as a task-conditioned selective filter over the visual observation. Our experiments showcase state-of-the-art performance for Object Goal Navigation and Object Displacement across 5 benchmarks, ProcTHOR, ArchitecTHOR, RoboTHOR, AI2-iTHOR, and ManipulaTHOR. The filtered representations produced by the codebook are also able to generalize better and converge faster when adapted to other simulation environments such as Habitat. Our qualitative analyses show that agents explore their environments more effectively and their representations retain task-relevant information like target object recognition while ignoring superfluous information about other objects.

Benchmark (Object Goal Navigation)	Model	SR(%) ▲	EL ▼	Curvature ▼	SEL ▲
ProcTHOR-10k (test)	EmbCLIP +Codebook (Ours)	67.70 73.72	182.00 136.00	0.58 0.23	36.00 43.69
ArchitecTHOR (0-shot)	EmbCLIP +Codebook (Ours)	55.80 58.33	222.00 174.00	0.49 0.20	20.57 28.31
AI2-THOR (0-shot)	EmbCLIP +Codebook (Ours)	70.00 78.40	121.00 86.00	0.29 0.16	21.45 26.76
RoboTHOR (0-shot)	EmbCLIP +Codebook (Ours)	51.32 55.00	- -	- -	- -

Selective Visual Representations Improve Convergence and Generalization for Embodied-AI

*Equal Contribution

Inspired by selective attention in humans—the process through which people filter their perception based on their experiences, knowledge, and the task at hand—we introduce a parameter-efficient approach to filter visual stimuli for Embodied-AI.

Abstract

The Codebook Module

A Filtering Mechanism of Visual Representations for Embodied-AI

Results

Codebook-Based Representations Improve Performance in Embodied-AI

Codebook-Bottlenecked Embedding is Easier to Transfer to New Visual Domains

Lightweight Fine-tuning of the Adaptation Module. We only finetune a few CNN layers, action and goal embedders, and the codebook scoring function when moving to a new visual domain.

Codebook Encodes Only the Most Important Information to the Task

GradCAM Attention Visualization. While EmbCLIP ObjectNav agent is distracted by different objects and other visual cues even though the target object is visible in the frame, the codebook module helps the agent to effectively ignore such distractions and only focus on the object goal.

Our Agent Explores More Effectively and Travels in Smoother Trajectories

EmbCLIP-Codebook

EmbCLIP

EmbCLIP-Codebook

EmbCLIP

EmbCLIP-Codebook

EmbCLIP

Our agent explores the environment much more effectively and travels in much smoother trajectories. Whereas the EmbCLIP baseline agent makes many redundant rotations.

BibTeX