SigLIP
This model was released on 2023-03-27 and added to Hugging Face Transformers on 2024-01-08.
SigLIP
Section titled “SigLIP”SigLIP is a multimodal image-text model similar to CLIP. It uses separate image and text encoders to generate representations for both modalities.
Unlike CLIP, SigLIP employs a pairwise sigmoid loss on image-text pairs during training. This training loss eliminates the need for a global view of all pairwise similarities between images and texts within a batch. Consequently, it enables more efficient scaling to larger batch sizes while also delivering superior performance with smaller batch sizes.
You can find all the original SigLIP checkpoints under the SigLIP collection.
The example below demonstrates how to generate similarity scores between texts and image(s) with Pipeline or the AutoModel class.
import torchfrom transformers import pipeline
image = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"candidate_labels = ["a Pallas cat", "a lion", "a Siberian tiger"]
pipeline = pipeline(task="zero-shot-image-classification", model="google/siglip-base-patch16-224", device=0, dtype=torch.bfloat16)pipeline(image, candidate_labels=candidate_labels)import torchimport requestsfrom PIL import Imagefrom transformers import AutoProcessor, AutoModel
model = AutoModel.from_pretrained("google/siglip-base-patch16-224", dtype=torch.float16, device_map="auto", attn_implementation="sdpa")processor = AutoProcessor.from_pretrained("google/siglip-base-patch16-224")
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"image = Image.open(requests.get(url, stream=True).raw)candidate_labels = ["a Pallas cat", "a lion", "a Siberian tiger"]texts = [f'This is a photo of {label}.' for label in candidate_labels]inputs = processor(text=texts, images=image, padding="max_length", return_tensors="pt").to(model.device)
with torch.no_grad(): outputs = model(**inputs)
logits_per_image = outputs.logits_per_imageprobs = torch.sigmoid(logits_per_image)print(f"{probs[0][0]:.1%} that image 0 is '{candidate_labels[0]}'")Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the Quantization overview for more available quantization backends.
The example below uses bitsandbytes to only quantize the weights to int4.
import torchimport requestsfrom PIL import Imagefrom transformers import AutoProcessor, AutoModel, BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(load_in_4bit=True)model = AutoModel.from_pretrained("google/siglip-base-patch16-224", quantization_config=bnb_config, device_map="auto", attn_implementation="sdpa")processor = AutoProcessor.from_pretrained("google/siglip-base-patch16-224")
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"image = Image.open(requests.get(url, stream=True).raw)candidate_labels = ["a Pallas cat", "a lion", "a Siberian tiger"]texts = [f'This is a photo of {label}.' for label in candidate_labels]inputs = processor(text=texts, images=image, padding="max_length", return_tensors="pt").to(model.device)
with torch.no_grad(): outputs = model(**inputs)
logits_per_image = outputs.logits_per_imageprobs = torch.sigmoid(logits_per_image)print(f"{probs[0][0]:.1%} that image 0 is '{candidate_labels[0]}'")-
Training is supported for DDP and FSDP on single-node multi-GPU setups. However, it does not use torch.distributed utilities which may limit the scalability of batch size.
-
When using the standalone
SiglipTokenizerorSiglipProcessor, make sure to passpadding="max_length"because that is how the model was trained. -
To get the same results as the
Pipeline, a prompt template of"This is a photo of {label}."should be passed to the processor. -
Toggle the
attn_implementationparameter to either"sdpa"or"flash_attention_2"to use a more memory-efficient attention.# pip install -U flash-attn --no-build-isolationfrom transformers import SiglipModelmodel = SiglipModel.from_pretrained("google/siglip-so400m-patch14-384",attn_implementation="flash_attention_2",dtype=torch.float16,device_map=device,)
SiglipConfig
Section titled “SiglipConfig”[[autodoc]] SiglipConfig
SiglipTextConfig
Section titled “SiglipTextConfig”[[autodoc]] SiglipTextConfig
SiglipVisionConfig
Section titled “SiglipVisionConfig”[[autodoc]] SiglipVisionConfig
SiglipTokenizer
Section titled “SiglipTokenizer”[[autodoc]] SiglipTokenizer - build_inputs_with_special_tokens - get_special_tokens_mask - create_token_type_ids_from_sequences - save_vocabulary
SiglipImageProcessor
Section titled “SiglipImageProcessor”[[autodoc]] SiglipImageProcessor - preprocess
SiglipImageProcessorFast
Section titled “SiglipImageProcessorFast”[[autodoc]] SiglipImageProcessorFast - preprocess
SiglipProcessor
Section titled “SiglipProcessor”[[autodoc]] SiglipProcessor
SiglipModel
Section titled “SiglipModel”[[autodoc]] SiglipModel - forward - get_text_features - get_image_features
SiglipTextModel
Section titled “SiglipTextModel”[[autodoc]] SiglipTextModel - forward
SiglipVisionModel
Section titled “SiglipVisionModel”[[autodoc]] SiglipVisionModel - forward
SiglipForImageClassification
Section titled “SiglipForImageClassification”[[autodoc]] SiglipForImageClassification - forward