CLIP
This model was released on 2021-02-26 and added to Hugging Face Transformers on 2021-05-12.
CLIP is a is a multimodal vision and language model motivated by overcoming the fixed number of object categories when training a computer vision model. CLIP learns about images directly from raw text by jointly training on 400M (image, text) pairs. Pretraining on this scale enables zero-shot transfer to downstream tasks. CLIP uses an image encoder and text encoder to get visual features and text features. Both features are projected to a latent space with the same number of dimensions and their dot product gives a similarity score.
You can find all the original CLIP checkpoints under the OpenAI organization.
The example below demonstrates how to calculate similarity scores between multiple text descriptions and an image with Pipeline or the AutoModel class.
import torchfrom transformers import pipeline
clip = pipeline( task="zero-shot-image-classification", model="openai/clip-vit-base-patch32", dtype=torch.bfloat16, device=0)labels = ["a photo of a cat", "a photo of a dog", "a photo of a car"]clip("http://images.cocodataset.org/val2017/000000039769.jpg", candidate_labels=labels)import requestsimport torchfrom PIL import Imagefrom transformers import AutoProcessor, AutoModel
model = AutoModel.from_pretrained("openai/clip-vit-base-patch32", dtype=torch.bfloat16, attn_implementation="sdpa")processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")
url = "http://images.cocodataset.org/val2017/000000039769.jpg"image = Image.open(requests.get(url, stream=True).raw)labels = ["a photo of a cat", "a photo of a dog", "a photo of a car"]
inputs = processor(text=labels, images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)logits_per_image = outputs.logits_per_imageprobs = logits_per_image.softmax(dim=1)most_likely_idx = probs.argmax(dim=1).item()most_likely_label = labels[most_likely_idx]print(f"Most likely label: {most_likely_label} with probability: {probs[0][most_likely_idx].item():.3f}")- Use
CLIPImageProcessorto resize (or rescale) and normalizes images for the model.
CLIPConfig
Section titled “CLIPConfig”[[autodoc]] CLIPConfig
CLIPTextConfig
Section titled “CLIPTextConfig”[[autodoc]] CLIPTextConfig
CLIPVisionConfig
Section titled “CLIPVisionConfig”[[autodoc]] CLIPVisionConfig
CLIPTokenizer
Section titled “CLIPTokenizer”[[autodoc]] CLIPTokenizer - get_special_tokens_mask - save_vocabulary
CLIPTokenizerFast
Section titled “CLIPTokenizerFast”[[autodoc]] CLIPTokenizerFast
CLIPImageProcessor
Section titled “CLIPImageProcessor”[[autodoc]] CLIPImageProcessor - preprocess
CLIPImageProcessorFast
Section titled “CLIPImageProcessorFast”[[autodoc]] CLIPImageProcessorFast - preprocess
CLIPProcessor
Section titled “CLIPProcessor”[[autodoc]] CLIPProcessor
CLIPModel
Section titled “CLIPModel”[[autodoc]] CLIPModel - forward - get_text_features - get_image_features
CLIPTextModel
Section titled “CLIPTextModel”[[autodoc]] CLIPTextModel - forward
CLIPTextModelWithProjection
Section titled “CLIPTextModelWithProjection”[[autodoc]] CLIPTextModelWithProjection - forward
CLIPVisionModelWithProjection
Section titled “CLIPVisionModelWithProjection”[[autodoc]] CLIPVisionModelWithProjection - forward
CLIPVisionModel
Section titled “CLIPVisionModel”[[autodoc]] CLIPVisionModel - forward
CLIPForImageClassification
Section titled “CLIPForImageClassification”[[autodoc]] CLIPForImageClassification - forward