Text-to-3D by Stitching a Multi-view Reconstruction Network to a Video Generator
VIST3A is a novel framework that seamlessly integrates pretrained text-to-video generators with feedforward 3D reconstruction networks through model stitching and direct reward finetuning, enabling high-quality text-to-3D and text-to-pointmap generation that surpasses existing Gaussian splat-based approaches.