Abstract
We introduce ET-Former, a novel end-to-end algorithm for semantic scene completion using a single monoc- ular camera. Our approach generates a semantic occupancy map from single RGB observation while simultaneously pro- viding uncertainty estimates for semantic predictions. By designing a triplane-based deformable attention mechanism, our approach improves geometric understanding of the scene than other SOTA approaches and reduces noise in semantic predictions. Additionally, through the use of a Conditional Variational AutoEncoder (CVAE), we estimate the uncer- tainties of these predictions. The generated semantic and uncertainty maps will help formulate navigation strategies that facilitate safe and permissible decision making in the future. Evaluated on the Semantic-KITTI dataset, ET-Former achieves the highest Intersection over Union (IoU) and mean IoU (mIoU) scores while maintaining the lowest GPU memory usage, surpassing state-of-the-art (SOTA) methods. It improves the SOTA scores of IoU from 44.71 to 51.49 and mIoU from 15.04 to 16.30 on SeamnticKITTI test, with a notably low training memory consumption of 10.9 GB, achieving at least a 25% reduction compared to previous methods.
Paper
ET-Former: Efficient Triplane Deformable Attention for 3D Semantic Scene Completion From Monocular Camera.
Jing Liang, He Yin, Xuewei Qi, Jong Jin Park, Min Sun, Rajasimman Madhivanan, Dinesh Manocha