Escaping The Big Data Paradigm in Self-Supervised Representation Learning
This paper introduces SCOTT, a sparse convolutional tokenizer combined with a MIM-JEPA training framework, which enables Vision Transformers to learn robust self-supervised representations from scratch on small-scale, fine-grained datasets, thereby challenging the necessity of big data and massive computational resources for effective vision representation learning.