MoSA: Motion-Coherent Human Video Generation via Structure-Appearance Decoupling
MoSA is a novel framework that decouples human video generation into structure and appearance components, utilizing a 3D structure transformer and specialized constraints to achieve superior motion coherence and realistic human-environment interactions compared to existing models.