Learning Transferable Visual Models From Natural Language Supervision

Published:

Recommended citation: Radford, Alec & Kim, Jong Wook, et al. "Learning Transferable Visual Models From Natural Language Supervision." (2021). https://cdn.openai.com/papers/Learning_Transferable_Visual_Models_From_Natural_Language_Supervision.pdf

Summary: State-of-the-art computer vision systems aretrained to predict a fixed set of predeterminedobject categories. This restricted form of super-vision limits their generality and usability sinceadditional labeled data is needed to specify anyother visual concept. Learning directly from rawtext about images is a promising alternative whichleverages a much broader source of supervision.We demonstrate that the simple pre-training taskof predicting which caption goes with which im-age is an efficient and scalable way to learn SOTAimage representations from scratch on a datasetof 400 million (image, text) pairs collected fromthe internet.

State-of-the-art computer vision systems aretrained to predict a fixed set of predeterminedobject categories. This restricted form of super-vision limits their generality and usability sinceadditional labeled data is needed to specify anyother visual concept. Learning directly from rawtext about images is a promising alternative whichleverages a much broader source of supervision.We demonstrate that the simple pre-training taskof predicting which caption goes with which im-age is an efficient and scalable way to learn SOTAimage representations from scratch on a datasetof 400 million (image, text) pairs collected fromthe internet. After pre-training, natural languageis used to reference learned visual concepts (ordescribe new ones) enabling zero-shot transferof the model to downstream tasks. We studythe performance of this approach by benchmark-ing on over 30 different existing computer vi-sion datasets, spanning tasks such as OCR, actionrecognition in videos, geo-localization, and manytypes of fine-grained object classification. Themodel transfers non-trivially to most tasks and isoften competitive with a fully supervised baselinewithout the need for any dataset specific training.For instance, we match the accuracy of the orig-inal ResNet-50 on ImageNet zero-shot withoutneeding to use any of the 1.28 million trainingexamples it was trained on. [Read the paper here](https://cdn.openai.com/papers/Learning_Transferable_Visual_Models_From_Natural_Language_Supervision.pdf) Recommended citation: Radford, Alec & Kim, Jong Wook, et al. "Learning Transferable Visual Models From Natural Language Supervision." (2021).

Leave a Comment