Summary: State-of-the-art computer vision systems aretrained to predict a fixed set of predeterminedobject categories. This restricted form of super-vision limits their generality and usability sinceadditional labeled data is needed to specify anyother visual concept. Learning directly from rawtext about images is a promising alternative whichleverages a much broader source of supervision.We demonstrate that the simple pre-training taskof predicting which caption goes with which im-age is an efficient and scalable way to learn SOTAimage representations from scratch on a datasetof 400 million (image, text) pairs collected fromthe internet.
Read the paper here
Recommended citation: Radford, Alec & Kim, Jong Wook, et al. “Learning Transferable Visual Models From Natural Language Supervision.” (2021).