Recent advances in 2D image generation have achieved remarkable quality,largely driven by the capacity of diffusion models and the availability of large-scale datasets. However, direct 3D generation is still constrained by the scarcity and lower fidelity of 3D datasets. In this paper, we introduce Zero-1-to-G, a novel approach that addresses this problem by enabling direct single-view generation on Gaussian splats using pretrained 2D diffusion models. Our key insight is that Gaussian splats, a 3D representation, can be decomposed into multi-view images encoding different attributes. This reframes the challenging task of direct 3D generation within a 2D diffusion framework, allowing us to leverage the rich priors of pretrained 2D diffusion models. To incorporate 3D awareness, we introduce cross-view and cross-attribute attention layers, which capture complex correlations and enforce 3D consistency across generated splats. This makes Zero-1-to-G the first direct image-to-3D generative model to effectively utilize pretrained 2D diffusion priors, enabling efficient training and improved generalization to unseen objects. Extensive experiments on both synthetic and in-the-wild datasets demonstrate superior performance in 3D object generation, offering a new approach to high-quality 3D generation.
近年来,2D 图像生成技术取得了显著进展,这主要得益于扩散模型的强大能力和大规模数据集的可用性。然而,直接生成 3D 数据仍然受到 3D 数据集稀缺性和较低保真度的限制。本文提出了一种新方法 Zero-1-to-G,通过利用预训练的 2D 扩散模型,实现了在高斯点(Gaussian splats)上的直接单视角生成,从而应对这一问题。 我们的核心见解是,高斯点作为一种 3D 表示形式,可以分解为包含不同属性的多视角图像。这将直接 3D 生成这一具有挑战性的任务重新框定为 2D 扩散框架内的问题,使我们能够充分利用预训练 2D 扩散模型的丰富先验知识。为了引入 3D 感知,我们提出了跨视角和跨属性注意力层,这些注意力层能够捕捉复杂的相关性并在生成的高斯点之间强制实现 3D 一致性。这使得 Zero-1-to-G 成为首个有效利用预训练 2D 扩散先验的直接图像到 3D 生成模型,实现了高效训练并提升了对未见物体的泛化能力。 在合成数据集和真实场景数据集上的大量实验表明,Zero-1-to-G 在 3D 对象生成方面表现出色,提供了一种生成高质量 3D 数据的新方法,为 3D 生成领域开辟了新的可能性。