Improving Zero-Shot ObjectNav via Generative Communication

University of Maryland, College Park
*Equal contribution

Abstract

We propose a new method for improving Zero-Shot ObjectNav that aims to utilize potentially available environmental percepts. Our approach takes into account that the ground agent may have limited and sometimes obstructed view. Our formulation encourages Generative Communication (GC) between an assistive overhead agent with a global view containing the target object and the ground agent with an obfuscated view; both equipped with Vision-Language Models (VLMs) for vision-to-language translation. In this assisted setup, the embodied agents communicate environmental information before the ground agent executes actions towards a target. Despite the overhead agent having a global view with the target, we note a drop in performance (13%) of a fully cooperative assistance scheme over an unassisted baseline. In contrast, a selective assistance scheme where the ground agent retains its independent exploratory behaviour shows a 10% improvement.


Interpolation end reference image.

Overview: We tackle zero-shot ObjectNav in an assisted setup, where the ground agent aims to improve performance by seeking assistance from other available environmental percepts. We consider an overhead agent (as shown) with a clear view of the target and a ground agent with an obstructed view of the target that convey environmental information to each other via freeform, unconstrained Generative Communication (GC). We use GC to develop two novel assisted navigation schemes and present results in both simulated and real-world environments, inferring that GC is useful only in a selective setup where the ground agent retains its independent exploration capability.


Approach

We study the zero-shot ObjectNav task in an assisted setup where a dynamic ground agent communicates with a static overhead agent seeking to improve its ObjectNav performance. Using the expansive view of the environment, the overhead agent is expected to guide the ground agent towards regions with high likelihood of containing the target object. The ground agent reports the details of the environment to provide better context to the overhead agent about its own location and surroundings. In such collaborative setups, effective communication is necessary for meaningful collaboration. The translation of complex perceptual cues into a shared and sufficiently descriptive language for precise navigational actions is a necessity in this regard. We can leverage the powerful translational capabilities of VLMs, by equipping both ground and overhead agents with VLMs to perform vision-to-language translation.


Interpolation end reference image.

We consider 3 different setups for assisted ObjectNav on a ground agent (GA) using an overhead agent (OA). The No Comm. case (brown arrows) is a baseline ObjectNav setup where the GA is prompted directly by a VLM for navigation actions for the ground agent. This is illustrated on the left. For the remaining two cases, both agents first go through a Comm. phase (𝒸) for a fixed number of interactions Clen. We then summarize the dialogues for decision-making. In the Cooperative Action case (blue arrows), we pass the Generative Communication (GC) to an LLM that predicts an action for the GA. In the Selective Execution case (green arrows), the GA's VLM is prompted with the suggested action and asked if it wants to cooperate with the LLM prediction. If not, it performs independent exploration like the No. Comm. case. We later analyze the dialogue generated to measure generative communication traits.

Simulation Experiments

We run experiments in RoboTHOR simulator using a LoCoBot as the ground agent and add a third party camera as the overhead agent. As VLM, we use GPT-4V for both the agent. We run expriments on 100 house environments from ProcTHOR. The following image show an example:

OA View

GA View


We summarize our findings in the following table:


Comm. Length Navigation Generative Comm. Traits
(CLen) Execution OSR % ↑ SPL% ↑ PE % ↓ GO % ↓ CR % DS %
0 Random (No VLM) 15.00 11.69 N/A
No Comm. Baseline 22.00 21.50 N/A
1 Cooperative Action 19.00 (-3.0) 18.38 (-3.1) 48.20 34.55 100 91.5
Selective Action 26.00 (+4.0) 24.13 (+2.6) 34.55 42.51 23.20 92.2
3 Cooperative Action 19.00 (-3.0) 17.94 (-3.5) 77.90 32.73 100 91.6
Selective Action 24.00 (+2.0) 22.19 (+0.6) 78.00 34.55 22.80 91.6
5 Cooperative Action 9.00 (-13.0) 8.50 (-13.0) 81.60 32.73 100 92.4
Selective Action 32.00 (+10.0) 29.15 (+8.0) 80.70 32.73 18.90 92.1

where,

  • OSR: Oracle Success Rate
  • SPL: Success weighted by Path Length
  • PE: Preemptive Action Hallucincation
  • GO: Ghost Object Hallucination
  • CR: Cooperation Rate
  • DS: Dialogue Similarity


Real-world Experiments

We run real world experiments in our lab using a Turtlebot2 as the ground agent and add a GoPro Hero 7 camera as the overhead agent. We try our setup in different object arrangements and employ prompt finetuning to address the challenges originatig from environmetal conditions.


Interpolation end reference image.

Real-world Results: We carry out a real-world experiment with a Turtlebot as a Ground Agent (GA) and a GoPro camera mounted to the roof as an Overhead Agent (OA) in various environment settings. Note the incorrect action taken in the cooperative execution case (red arrows) in comparison to the selective case (green arrows). The actions predicted are in yellow. In our extended manuscript on arxiv, we discuss various hallucinations with different environment settings we encounter and how we finetune VLM prompts for better results. We show our approach in action in this video.

Citation

@article{dorbala2024generative,
  title={Improving Zero-Shot ObjectNav with Generative Communication},
  author={Dorbala, Vishnu Sashank and Sharma, Vishnu Dutt and Tokekar, Pratap and Manocha, Dinesh},
  journal={arXiv preprint arXiv:2408.01877},
  year={2024}
}