MetaCLIP 2
This model was released on {release_date} and added to Hugging Face Transformers on 2025-08-20.
MetaCLIP 2
Section titled “MetaCLIP 2”Overview
Section titled “Overview”MetaCLIP 2 is a replication of the original CLIP model trained on 300+ languages. It achieves state-of-the-art (SOTA) results on multilingual benchmarks (e.g., XM3600, CVQA, Babel‑ImageNet), surpassing previous SOTA such as mSigLIP and SigLIP‑2. The authors show that English and non-English worlds can mutually benefit and elevate each other.
This model was contributed by nielsr. The original code can be found here.
You can find all the MetaCLIP 2 checkpoints under the Meta organization.
The example below demonstrates how to calculate similarity scores between multiple text descriptions and an image with Pipeline or the AutoModel class. Usage of the MetaCLIP 2 models is identical to the CLIP models, you just need the MetaClip2Model class instead of CLIPModel.
import torchfrom transformers import pipeline
clip = pipeline( task="zero-shot-image-classification", model="facebook/metaclip-2-worldwide-huge-quickgelu", dtype=torch.bfloat16, device=0)labels = ["a photo of a cat", "a photo of a dog", "a photo of a car"]clip("http://images.cocodataset.org/val2017/000000039769.jpg", candidate_labels=labels)import requestsimport torchfrom PIL import Imagefrom transformers import AutoProcessor, AutoModel
model = AutoModel.from_pretrained("facebook/metaclip-2-worldwide-huge-quickgelu", dtype=torch.bfloat16, attn_implementation="sdpa")processor = AutoProcessor.from_pretrained("facebook/metaclip-2-worldwide-huge-quickgelu")
url = "http://images.cocodataset.org/val2017/000000039769.jpg"image = Image.open(requests.get(url, stream=True).raw)labels = ["a photo of a cat", "a photo of a dog", "a photo of a car"]
inputs = processor(text=labels, images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)logits_per_image = outputs.logits_per_imageprobs = logits_per_image.softmax(dim=1)most_likely_idx = probs.argmax(dim=1).item()most_likely_label = labels[most_likely_idx]print(f"Most likely label: {most_likely_label} with probability: {probs[0][most_likely_idx].item():.3f}")MetaClip2Config
Section titled “MetaClip2Config”[[autodoc]] MetaClip2Config
MetaClip2TextConfig
Section titled “MetaClip2TextConfig”[[autodoc]] MetaClip2TextConfig
MetaClip2VisionConfig
Section titled “MetaClip2VisionConfig”[[autodoc]] MetaClip2VisionConfig
MetaClip2Model
Section titled “MetaClip2Model”[[autodoc]] MetaClip2Model - forward - get_text_features - get_image_features
MetaClip2TextModel
Section titled “MetaClip2TextModel”[[autodoc]] MetaClip2TextModel - forward
MetaClip2TextModelWithProjection
Section titled “MetaClip2TextModelWithProjection”[[autodoc]] MetaClip2TextModelWithProjection - forward
MetaClip2VisionModelWithProjection
Section titled “MetaClip2VisionModelWithProjection”[[autodoc]] MetaClip2VisionModelWithProjection - forward
MetaClip2VisionModel
Section titled “MetaClip2VisionModel”[[autodoc]] MetaClip2VisionModel - forward
MetaClip2ForImageClassification
Section titled “MetaClip2ForImageClassification”[[autodoc]] MetaClip2ForImageClassification - forward