V-JEPA: Latent Video Prediction for Visual Representation Learning
Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mido Assran, Nicolas Ballas
openreview.net
This paper shows that the masked-modelling principle driving the success of large foundational language models can be effectively applied to video by making predictions in latent space. We introduce V-JEPA, a method for self-supervised learning from video that predicts masked spatio-temporal regions in a learned representation space. Our latent video prediction strategy produces visual features that can be applied to various downstream image and video tasks without adaption of the model's parameters (using only frozen evaluation), achieving 82.1% on Kinetics-400 and 71.2% on Something-Something-v2, surpassing the previous best video models by +4 and +10 points respectively. We also demonstrate the benefit of video pretraining compared to image pretraining for tasks involving motion understanding, where V-JEPA outperforms the largest state-of-the-art image models, DINOv2 and OpenCLIP. Finally, V-JEPA trained only on video achieves 77.9% on ImageNet classification without any image fine-tuning, surpassing the previous best video model by +6 points top-1.