SAM3
This model was released on 2025-11-19 and added to Hugging Face Transformers on 2025-11-19.
Overview
Section titled “Overview”SAM3 (Segment Anything Model 3) was introduced in SAM 3: Segment Anything with Concepts.
SAM3 performs Promptable Concept Segmentation (PCS) on images. PCS takes text and/or image exemplars as input (e.g., “yellow school bus”), and predicts instance and semantic masks for every single object matching the concept.
The abstract from the paper is the following:
We present Segment Anything Model (SAM) 3, a unified model that detects, segments, and tracks objects in images and videos based on concept prompts, which we define as either short noun phrases (e.g., “yellow school bus”), image exemplars, or a combination of both. Promptable Concept Segmentation (PCS) takes such prompts and returns segmentation masks and unique identities for all matching object instances. To advance PCS, we build a scalable data engine that produces a high-quality dataset with 4M unique concept labels, including hard negatives, across images and videos. Our model consists of an image-level detector and a memory-based video tracker that share a single backbone. Recognition and localization are decoupled with a presence head, which boosts detection accuracy. SAM 3 doubles the accuracy of existing systems in both image and video PCS, and improves previous SAM capabilities on visual segmentation tasks. We open source SAM 3 along with our new Segment Anything with Concepts (SA-Co) benchmark for promptable concept segmentation.
This model was contributed by yonigozlan and ronghanghu.
Usage examples with 🤗 Transformers
Section titled “Usage examples with 🤗 Transformers”Text-Only Prompts
Section titled “Text-Only Prompts”>>> from transformers import Sam3Processor, Sam3Model>>> import torch>>> from PIL import Image>>> import requests
>>> device = "cuda" if torch.cuda.is_available() else "cpu"
>>> model = Sam3Model.from_pretrained("facebook/sam3").to(device)>>> processor = Sam3Processor.from_pretrained("facebook/sam3")
>>> # Load image>>> image_url = "http://images.cocodataset.org/val2017/000000077595.jpg">>> image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB")
>>> # Segment using text prompt>>> inputs = processor(images=image, text="ear", return_tensors="pt").to(device)
>>> with torch.no_grad():... outputs = model(**inputs)
>>> # Post-process results>>> results = processor.post_process_instance_segmentation(... outputs,... threshold=0.5,... mask_threshold=0.5,... target_sizes=inputs.get("original_sizes").tolist()... )[0]
>>> print(f"Found {len(results['masks'])} objects")>>> # Results contain:>>> # - masks: Binary masks resized to original image size>>> # - boxes: Bounding boxes in absolute pixel coordinates (xyxy format)>>> # - scores: Confidence scoresSingle Bounding Box Prompt
Section titled “Single Bounding Box Prompt”Segment objects using a bounding box on the visual concept:
>>> # Box in xyxy format: [x1, y1, x2, y2] in pixel coordinates>>> # Example: laptop region>>> box_xyxy = [100, 150, 500, 450]>>> input_boxes = [[box_xyxy]] # [batch, num_boxes, 4]>>> input_boxes_labels = [[1]] # 1 = positive box
>>> inputs = processor(... images=image,... input_boxes=input_boxes,... input_boxes_labels=input_boxes_labels,... return_tensors="pt"... ).to(device)
>>> with torch.no_grad():... outputs = model(**inputs)
>>> # Post-process results>>> results = processor.post_process_instance_segmentation(... outputs,... threshold=0.5,... mask_threshold=0.5,... target_sizes=inputs.get("original_sizes").tolist()... )[0]Multiple Box Prompts (Positive and Negative)
Section titled “Multiple Box Prompts (Positive and Negative)”Use multiple boxes with positive and negative labels to refine the concept:
>>> # Load kitchen image>>> kitchen_url = "http://images.cocodataset.org/val2017/000000136466.jpg">>> kitchen_image = Image.open(requests.get(kitchen_url, stream=True).raw).convert("RGB")
>>> # Define two positive boxes (e.g., dial and button on oven)>>> # Boxes are in xyxy format [x1, y1, x2, y2] in pixel coordinates>>> box1_xyxy = [59, 144, 76, 163] # Dial box>>> box2_xyxy = [87, 148, 104, 159] # Button box>>> input_boxes = [[box1_xyxy, box2_xyxy]]>>> input_boxes_labels = [[1, 1]] # Both positive
>>> inputs = processor(... images=kitchen_image,... input_boxes=input_boxes,... input_boxes_labels=input_boxes_labels,... return_tensors="pt"... ).to(device)
>>> with torch.no_grad():... outputs = model(**inputs)
>>> # Post-process results>>> results = processor.post_process_instance_segmentation(... outputs,... threshold=0.5,... mask_threshold=0.5,... target_sizes=inputs.get("original_sizes").tolist()... )[0]Combined Prompts (Text + Negative Box)
Section titled “Combined Prompts (Text + Negative Box)”Use text prompts with negative visual prompts to refine the concept:
>>> # Segment "handle" but exclude the oven handle using a negative box>>> text = "handle">>> # Negative box covering oven handle area (xyxy): [40, 183, 318, 204]>>> oven_handle_box = [40, 183, 318, 204]>>> input_boxes = [[oven_handle_box]]
>>> inputs = processor(... images=kitchen_image,... text=text,... input_boxes=input_boxes,... input_boxes_labels=[[0]], # 0 = negative (exclude this region)... return_tensors="pt"... ).to(device)
>>> with torch.no_grad():... outputs = model(**inputs)
>>> # Post-process results>>> results = processor.post_process_instance_segmentation(... outputs,... threshold=0.5,... mask_threshold=0.5,... target_sizes=inputs.get("original_sizes").tolist()... )[0]>>> # This will segment pot handles but exclude the oven handleBatched Inference with Text Prompts
Section titled “Batched Inference with Text Prompts”Process multiple images with different text prompts efficiently:
>>> cat_url = "http://images.cocodataset.org/val2017/000000077595.jpg">>> kitchen_url = "http://images.cocodataset.org/val2017/000000136466.jpg">>> images = [... Image.open(requests.get(cat_url, stream=True).raw).convert("RGB"),... Image.open(requests.get(kitchen_url, stream=True).raw).convert("RGB")... ]
>>> # Different text prompt for each image>>> text_prompts = ["ear", "dial"]
>>> inputs = processor(images=images, text=text_prompts, return_tensors="pt").to(device)
>>> with torch.no_grad():... outputs = model(**inputs)
>>> # Post-process results for both images>>> results = processor.post_process_instance_segmentation(... outputs,... threshold=0.5,... mask_threshold=0.5,... target_sizes=inputs.get("original_sizes").tolist()... )
>>> print(f"Image 1: {len(results[0]['masks'])} objects found")>>> print(f"Image 2: {len(results[1]['masks'])} objects found")Batched Mixed Prompts
Section titled “Batched Mixed Prompts”Use different prompt types for different images in the same batch:
>>> # Image 1: text prompt "laptop">>> # Image 2: visual prompt (dial box)>>> box2_xyxy = [59, 144, 76, 163]
>>> inputs = processor(... images=images,... text=["laptop", None], # Only first image has text... input_boxes=[None, [box2_xyxy]], # Only second image has box... input_boxes_labels=[None, [1]], # Positive box for second image... return_tensors="pt"... ).to(device)
>>> with torch.no_grad():... outputs = model(**inputs)
>>> # Post-process results for both images>>> results = processor.post_process_instance_segmentation(... outputs,... threshold=0.5,... mask_threshold=0.5,... target_sizes=inputs.get("original_sizes").tolist()... )>>> # Both images processed in single forward passSemantic Segmentation Output
Section titled “Semantic Segmentation Output”SAM3 also provides semantic segmentation alongside instance masks:
>>> inputs = processor(images=image, text="ear", return_tensors="pt").to(device)
>>> with torch.no_grad():... outputs = model(**inputs)
>>> # Instance segmentation masks>>> instance_masks = torch.sigmoid(outputs.pred_masks) # [batch, num_queries, H, W]
>>> # Semantic segmentation (single channel)>>> semantic_seg = outputs.semantic_seg # [batch, 1, H, W]
>>> print(f"Instance masks: {instance_masks.shape}")>>> print(f"Semantic segmentation: {semantic_seg.shape}")Efficient Multi-Prompt Inference on Single Image
Section titled “Efficient Multi-Prompt Inference on Single Image”When running multiple text prompts on the same image, pre-compute vision embeddings to avoid redundant computation:
>>> from transformers import Sam3Processor, Sam3Model>>> import torch>>> from PIL import Image>>> import requests
>>> device = "cuda" if torch.cuda.is_available() else "cpu"
>>> model = Sam3Model.from_pretrained("facebook/sam3").to(device)>>> processor = Sam3Processor.from_pretrained("facebook/sam3")
>>> # Load image>>> image_url = "http://images.cocodataset.org/val2017/000000077595.jpg">>> image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB")
>>> # Pre-process image and compute vision embeddings once>>> img_inputs = processor(images=image, return_tensors="pt").to(device)>>> with torch.no_grad():... vision_embeds = model.get_vision_features(pixel_values=img_inputs.pixel_values)
>>> # Run multiple text prompts efficiently>>> text_prompts = ["ear", "eye", "nose"]>>> all_results = []
>>> for prompt in text_prompts:... text_inputs = processor(text=prompt, return_tensors="pt").to(device)... with torch.no_grad():... outputs = model(vision_embeds=vision_embeds, **text_inputs)...... results = processor.post_process_instance_segmentation(... outputs,... threshold=0.5,... mask_threshold=0.5,... target_sizes=img_inputs.get("original_sizes").tolist()... )[0]... all_results.append({"prompt": prompt, "results": results})
>>> for item in all_results:... print(f"Prompt '{item['prompt']}': {len(item['results']['masks'])} objects found")Efficient Single-Prompt Inference on Multiple Images
Section titled “Efficient Single-Prompt Inference on Multiple Images”When running the same text prompt on multiple images, pre-compute text embeddings to avoid redundant computation:
>>> from transformers import Sam3Processor, Sam3Model>>> import torch>>> from PIL import Image>>> import requests
>>> device = "cuda" if torch.cuda.is_available() else "cpu"
>>> model = Sam3Model.from_pretrained("facebook/sam3").to(device)>>> processor = Sam3Processor.from_pretrained("facebook/sam3")
>>> # Pre-compute text embeddings once>>> text_prompt = "ear">>> text_inputs = processor(text=text_prompt, return_tensors="pt").to(device)>>> with torch.no_grad():... text_embeds = model.get_text_features(**text_inputs)
>>> # Load multiple images>>> image_urls = [... "http://images.cocodataset.org/val2017/000000077595.jpg",... "http://images.cocodataset.org/val2017/000000039769.jpg",... ]>>> images = [Image.open(requests.get(url, stream=True).raw).convert("RGB") for url in image_urls]
>>> # Run inference on each image reusing text embeddings>>> # Note: attention_mask must be passed along with text_embeds for proper masking>>> all_results = []
>>> for image in images:... img_inputs = processor(images=image, return_tensors="pt").to(device)... with torch.no_grad():... outputs = model(... pixel_values=img_inputs.pixel_values,... text_embeds=text_embeds,... attention_mask=text_inputs.attention_mask,... )...... results = processor.post_process_instance_segmentation(... outputs,... threshold=0.5,... mask_threshold=0.5,... target_sizes=img_inputs.get("original_sizes").tolist()... )[0]... all_results.append(results)
>>> for i, results in enumerate(all_results):... print(f"Image {i+1}: {len(results['masks'])} '{text_prompt}' objects found")Custom Resolution Inference
Section titled “Custom Resolution Inference”For faster inference or lower memory usage:
>>> config = Sam3Config.from_pretrained("facebook/sam3")>>> config.image_size = 560>>> model = Sam3Model.from_pretrained("facebook/sam3", config=config).to(device)>>> processor = Sam3Processor.from_pretrained("facebook/sam3", size={"height": 560, "width": 560})Prompt Label Conventions
Section titled “Prompt Label Conventions”SAM3 uses the following label conventions:
For points and boxes:
1: Positive prompt (include this region/object)0: Negative prompt (exclude this region/object)-10: Padding value for batched inputs
Coordinate formats:
- Input boxes:
[x1, y1, x2, y2](xyxy format) in pixel coordinates - Output boxes (raw):
[x1, y1, x2, y2](xyxy format), normalized to [0, 1] - Output boxes (post-processed):
[x1, y1, x2, y2](xyxy format) in absolute pixel coordinates
Sam3Config
Section titled “Sam3Config”[[autodoc]] Sam3Config
Sam3ViTConfig
Section titled “Sam3ViTConfig”[[autodoc]] Sam3ViTConfig
Sam3VisionConfig
Section titled “Sam3VisionConfig”[[autodoc]] Sam3VisionConfig
Sam3GeometryEncoderConfig
Section titled “Sam3GeometryEncoderConfig”[[autodoc]] Sam3GeometryEncoderConfig
Sam3DETREncoderConfig
Section titled “Sam3DETREncoderConfig”[[autodoc]] Sam3DETREncoderConfig
Sam3DETRDecoderConfig
Section titled “Sam3DETRDecoderConfig”[[autodoc]] Sam3DETRDecoderConfig
Sam3MaskDecoderConfig
Section titled “Sam3MaskDecoderConfig”[[autodoc]] Sam3MaskDecoderConfig
Sam3Processor
Section titled “Sam3Processor”[[autodoc]] Sam3Processor - call
Sam3ImageProcessorFast
Section titled “Sam3ImageProcessorFast”[[autodoc]] Sam3ImageProcessorFast - preprocess
Sam3ViTModel
Section titled “Sam3ViTModel”[[autodoc]] Sam3ViTModel - forward
Sam3VisionModel
Section titled “Sam3VisionModel”[[autodoc]] Sam3VisionModel - forward
Sam3Model
Section titled “Sam3Model”[[autodoc]] Sam3Model - forward