Grounding Synthetic Data Generation With Vision and Language Models
This paper proposes a vision-language grounded framework for interpretable synthetic data generation and evaluation in remote sensing, introducing the ARAS400k dataset which demonstrates that augmenting real data with synthetic images consistently outperforms real-data-only baselines in semantic segmentation and image captioning tasks.