Understanding geometric, semantic, and instance information in 3D scenes from sequential video data is essential for applications in robotics and augmented reality. However, existing Simultaneous Localization and Mapping (SLAM) methods generally focus on either geometric or semantic reconstruction. In this paper, we introduce PanoSLAM, the first SLAM system to integrate geometric reconstruction, 3D semantic segmentation, and 3D instance segmentation within a unified framework. Our approach builds upon 3D Gaussian Splatting, modified with several critical components to enable efficient rendering of depth, color, semantic, and instance information from arbitrary viewpoints. To achieve panoptic 3D scene reconstruction from sequential RGB-D videos, we propose an online Spatial-Temporal Lifting (STL) module that transfers 2D panoptic predictions from vision models into 3D Gaussian representations. This STL module addresses the challenges of label noise and inconsistencies in 2D predictions by refining the pseudo labels across multi-view inputs, creating a coherent 3D representation that enhances segmentation accuracy. Our experiments show that PanoSLAM outperforms recent semantic SLAM methods in both mapping and tracking accuracy. For the first time, it achieves panoptic 3D reconstruction of open-world environments directly from the RGB-D video.
理解视频序列中三维场景的几何、语义和实例信息对机器人和增强现实等应用至关重要。然而,现有的同步定位与建图(SLAM)方法通常仅关注几何重建或语义重建的一方面。为此,我们提出了PanoSLAM,这是首个将几何重建、三维语义分割和三维实例分割整合在统一框架中的SLAM系统。 我们的方法基于三维高斯散射(3D Gaussian Splatting),并通过多项关键改进,使其能够从任意视角高效渲染深度、颜色、语义和实例信息。为实现从RGB-D视频到全景三维场景重建,我们提出了一个在线时空提升模块(Spatial-Temporal Lifting, STL),将二维视觉模型生成的全景预测转换为三维高斯表示。该模块通过多视角输入对伪标签进行优化,解决了二维预测中的标签噪声和不一致性问题,从而生成一致的三维表示,显著提高了分割准确性。 实验结果表明,PanoSLAM在建图和跟踪精度上优于近期的语义SLAM方法。更重要的是,该系统首次实现了从RGB-D视频直接生成开放世界环境的全景三维重建,为多模态场景理解开辟了新的可能性。