This paper tackles the problem of generalizable 3D-aware generation from monocular datasets, e.g., ImageNet. The key challenge of this task is learning a robust 3D-aware representation without multi-view or dynamic data, while ensuring consistent texture and geometry across different viewpoints. Although some baseline methods are capable of 3D-aware generation, the quality of the generated images still lags behind state-of-the-art 2D generation approaches, which excel in producing high-quality, detailed images. To address this severe limitation, we propose a novel feed-forward pipeline based on pixel-aligned Gaussian Splatting, coined as F3D-Gaus, which can produce more realistic and reliable 3D renderings from monocular inputs. In addition, we introduce a self-supervised cycle-consistent constraint to enforce cross-view consistency in the learned 3D representation. This training strategy naturally allows aggregation of multiple aligned Gaussian primitives and significantly alleviates the interpolation limitations inherent in single-view pixel-aligned Gaussian Splatting. Furthermore, we incorporate video model priors to perform geometry-aware refinement, enhancing the generation of fine details in wide-viewpoint scenarios and improving the model's capability to capture intricate 3D textures. Extensive experiments demonstrate that our approach not only achieves high-quality, multi-view consistent 3D-aware generation from monocular datasets, but also significantly improves training and inference efficiency.
本文研究了一种从单目数据集(如 ImageNet)中实现具有泛化能力的三维感知生成问题。该任务的关键挑战在于,在没有多视角或动态数据的情况下,学习一种稳健的三维感知表示,同时确保不同视角间纹理和几何的一致性。尽管一些基线方法能够实现三维感知生成,但生成图像的质量仍然落后于目前最先进的二维生成方法,而后者在生成高质量、细节丰富的图像方面表现卓越。 为了解决这一严重限制,我们提出了一种基于像素对齐高斯散点的新型前馈流程,称为 F3D-Gaus,可以从单目输入中生成更加真实可靠的三维渲染。此外,我们引入了一种自监督的循环一致性约束,以在学习到的三维表示中强制实现跨视角一致性。这种训练策略自然地支持对多个对齐的高斯原语的聚合,并显著缓解了单视角像素对齐高斯散点内插的局限性。 进一步地,我们结合视频模型先验进行几何感知的优化,在宽视角场景下增强了细节生成能力,并提升了模型捕捉复杂三维纹理的能力。大量实验表明,我们的方法不仅能够实现从单目数据集中生成高质量、多视角一致的三维感知结果,还显著提高了训练和推理的效率。