Overview
LOC-ZSON: Language-driven Object-Centric Zero-Shot Object Retrieval and Navigation
In this paper, we present LOC-ZSON, a novel Language-driven Object-Centric image representation for object navigation task within complex scenes. We propose an object-centric image representation and corresponding losses for visual-language model (VLM) fine-tuning, which can handle complex object-level queries. In addition, we design a novel LLM-based augmentation and prompt templates for stability during training and zero-shot inference. We implement our method on Astro robot and deploy it in both simulated and real-world environments for zero-shot object navigation. We show that our proposed method can achieve an improvement of 1.38 - 13.38% in terms of text-to-image recall on different benchmark settings for the retrieval task. For object navigation, we show the benefit of our approach in simulation and real world, showing 5% and 16.67% improvement in terms of navigation success rate, respectively.
Video
Paper
LOC-ZSON: Language-driven Object-Centric Zero-Shot Object Retrieval and Navigation.
Tianrui Guan, Yurou Yang, Harry Cheng, Muyuan Lin, Richard Kim, Rajasimman Madhivanan, Arnie Sen, Dinesh Manocha
CLIP-Nav: Using CLIP for Zero-Shot Vision-and-Language Navigation
Household environments are visually diverse. Embodied agents performing Vision-and-Language Navigation (VLN) in the wild must be able to handle this diversity, while also following arbitrary language instructions. Recently, Vision-Language models like CLIP have shown great performance on the task of zero-shot object recognition. In this work, we ask if these models are also capable of zero-shot language grounding. In particular, we utilize CLIP to tackle the novel problem of zero-shot VLN using natural language referring expressions that describe target objects, in contrast to past work that used simple language templates describing object classes. We examine CLIP’s capability in making sequential navigational decisions without any dataset-specific finetuning, and study how it influences the path that an agent takes. Our results on the coarse-grained instruction following task of REVERIE demonstrate the navigational capability of CLIP, surpassing the supervised baseline in terms of both success rate (SR) and success weighted by path length (SPL). More importantly, we quantitatively show that our CLIP-based zero-shot approach generalizes better to show consistent performance across environments when compared to SOTA, fully supervised learning approaches when evaluated via Relative Change in Success (RCS).
Paper
CLIP-Nav: Using CLIP for Zero-Shot Vision-and-Language Navigation.
Vishnu Sashank Dorbala, Gunnar A Sigurdsson, Jesse Thomason, Robinson Piramuthu, Gaurav S Sukhatme
LGX: Can an Embodied Agent Find Your Cat-shaped Mug? LLM-Based Zero-Shot Object Navigation
We present LGX, a novel algorithm for Object Goal Navigation in a language-driven, zero-shot manner, where an embodied agent navigates to an arbitrarily described target object in a previously unexplored environment. Our approach leverages the capabilities of Large Language Models (LLMs) for making navigational decisions by mapping the LLMs implicit knowledge about the semantic context of the environment into sequential inputs for robot motion planning. We conduct experiments both in simulation and real world environments, and showcase factors that influence the decision making capabilities of LLMs for zero-shot navigation.
Video
Paper
Can an Embodied Agent Find Your Cat-shaped Mug? LLM-Based Zero-Shot Object Navigation.
Vishnu Sashank Dorbala, James F. Mullen Jr., Dinesh Manocha
Code
Code can be found here.
S-EQA: Tackling Subjective Queries in Embodied Question Answering
We present and tackle the problem of Embodied Question Answering (EQA) with Subjective Queries in a household environment. In contrast to simple queries like “What is the color of the sofa?” that are easily answerable by interpreting the object properties in the environment, subjective queries such as “Is the bathroom clean and dry?” are not straightforward to answer, relying on a consensus about states of multiple household objects. To tackle EQA with such queries, we first introduce S-EQA, a dataset of 2000 subjective queries and associated object consensuses. S-EQA is generated using Large Language Models (LLMs), by employing various in-context learning cues to ensure that the queries are not too simple, nor too ambiguous. We validate S-EQA via a large-scale user survey, where we make a positive inference upon the existence and requirement of a consensus in answering subjective queries. Our survey also allows us to gauge the LLM’s generational capabilities and the usability of such a dataset in real-world scenarios. Finally, we quantitatively evaluate this dataset on VirtualHome, a simulation platform for household environments. In performing VQA at both room and object levels we obtain accuracies of 42.31% and 58.26% respectively, setting a benchmark for objectively evaluating S-EQA in VirtualHome. To the best of our knowledge, this is the first work to introduce EQA with subjective queries, setting a new benchmark for the usability of embodied agents in real-world household environments. We will release S-EQA for public use.
Video
Paper
S-EQA: Tackling Subjective Queries in Embodied Question Answering.
Vishnu Sashank Dorbala, Prasoon Goyal, Robinson Piramuthu, Michael Johnston, Dinesh Manocha and Reza Ghanadan
Can LLMs Generate Human-Like Wayfinding Instructions? Towards Platform-Agnostic Embodied Instruction Synthesis
We present a novel approach to automatically synthesize “wayfinding instructions” for an embodied robot agent. In contrast to prior approaches that are heavily reliant on human-annotated datasets designed exclusively for specific simulation platforms, our algorithm uses in-context learning to condition an LLM to generate instructions using just a few references. Using an LLM-based Visual Question Answering strategy, we gather detailed information about the environment which is used by the LLM for instruction synthesis. We implement our approach on multiple simulation platforms including Matterport3D, AI Habitat and ThreeDWorld, thereby demonstrating its platform-agnostic nature. We subjectively evaluate our approach via a user study and observe that 83.3% of users find the synthesized instructions accurately capture the details of the environment and show characteristics similar to those of human-generated instructions. Further, we conduct zero-shot navigation with multiple approaches on the REVERIE dataset using the generated instructions, and observe very close correlation with the baseline on standard success metrics (< 1% change in SR), quantifying the viability of generated instructions in replacing human-annotated data. We finally discuss the applicability of our approach in enabling a generalizable evaluation of embodied navigation policies. To the best of our knowledge, ours is the first LLM-driven approach capable of generating “human-like” instructions in a platform-agnostic manner, without training.
Paper
Can LLMs Generate Human-Like Wayfinding Instructions? Towards Platform-Agnostic Embodied Instruction Synthesis.
Vishnu Sashank Dorbala, Sanjoy Chowdhury, Dinesh Manocha
Right Place, Right Time! Towards ObjectNav for Non-Stationary Goals
We present a novel approach to tackle the ObjectNav task for non-stationary and potentially occluded targets in an indoor environment. We refer to this task Portable ObjectNav (or P-ObjectNav), and in this work, present its formulation, feasibility, and a navigation benchmark using a novel memory-enhanced LLM-based policy. In contrast to ObjNav where target object locations are fixed for each episode, P-ObjectNav tackles the challenging case where the target objects move during the episode. This adds a layer of time-sensitivity to navigation, and is particularly relevant in scenarios where the agent needs to find portable targets (e.g. misplaced wallets) in human-centric environments. The agent needs to estimate not just the correct location of the target, but also the time at which the target is at that location for visual grounding – raising the question about the feasibility of the task. We address this concern by inferring results on two cases for object placement: one where the objects placed follow a routine or a path, and the other where they are placed at random. We dynamize Matterport3D for these experiments, and modify PPO and LLM-based navigation policies for evaluation. Using PPO, we observe that agent performance in the random case stagnates, while the agent in the routine-following environment continues to improve, allowing us to infer that P-ObjectNav is solvable in environments with routine-following object placement. Using memory-enhancement on an LLM-based policy, we set a benchmark for P-ObjectNav. Our memory-enhanced agent significantly outperforms their non-memory-based counterparts across object placement scenarios by 71.76% and 74.68% on average when measured by Success Rate (SR) and Success Rate weighted by Path Length (SRPL), showing the influence of memory on improving P-ObjectNav performance. Our code and dataset will be made publicly available.
Paper
Right Place, Right Time! Towards ObjectNav for Non-Stationary Goals.
Vishnu Sashank Dorbala, Bhrij Patel, Amrit Singh Bedi, Dinesh Manocha
Improving Zero-Shot ObjectNav with Generative Communication
We propose a new method for improving Zero-Shot ObjectNav that aims to utilize potentially available environmental percepts. Our approach takes into account that the ground agent may have limited and sometimes obstructed view. Our formulation encourages Generative Communication (GC) between an assistive overhead agent with a global view containing the target object and the ground agent with an obfuscated view; both equipped with Vision-Language Models (VLMs) for vision-to-language translation. In this assisted setup, the embodied agents communicate environmental information before the ground agent executes actions towards a target. Despite the overhead agent having a global view with the target, we note a drop in performance (13%) of a fully cooperative assistance scheme over an unassisted baseline. In contrast, a selective assistance scheme where the ground agent retains its independent exploratory behaviour shows a 10% improvement.
Paper
Improving Zero-Shot ObjectNav with Generative Communication.
Vishnu Sashank Dorbala, Vishnu Dutt Sharma, Pratap Tokekar, Dinesh Manocha