RT-DETRv2
This model was released on 2024-07-24 and added to Hugging Face Transformers on 2025-02-06.
RT-DETRv2
Section titled “RT-DETRv2”
Overview
Section titled “Overview”The RT-DETRv2 model was proposed in RT-DETRv2: Improved Baseline with Bag-of-Freebies for Real-Time Detection Transformer by Wenyu Lv, Yian Zhao, Qinyao Chang, Kui Huang, Guanzhong Wang, Yi Liu.
RT-DETRv2 refines RT-DETR by introducing selective multi-scale feature extraction, a discrete sampling operator for broader deployment compatibility, and improved training strategies like dynamic data augmentation and scale-adaptive hyperparameters. These changes enhance flexibility and practicality while maintaining real-time performance.
The abstract from the paper is the following:
In this report, we present RT-DETRv2, an improved Real-Time DEtection TRansformer (RT-DETR). RT-DETRv2 builds upon the previous state-of-the-art real-time detector, RT-DETR, and opens up a set of bag-of-freebies for flexibility and practicality, as well as optimizing the training strategy to achieve enhanced performance. To improve the flexibility, we suggest setting a distinct number of sampling points for features at different scales in the deformable attention to achieve selective multi-scale feature extraction by the decoder. To enhance practicality, we propose an optional discrete sampling operator to replace the grid_sample operator that is specific to RT-DETR compared to YOLOs. This removes the deployment constraints typically associated with DETRs. For the training strategy, we propose dynamic data augmentation and scale-adaptive hyperparameters customization to improve performance without loss of speed.
This model was contributed by jadechoghari. The original code can be found here.
Usage tips
Section titled “Usage tips”This second version of RT-DETR improves how the decoder finds objects in an image.
- better sampling – adjusts offsets so the model looks at the right areas
- flexible attention – can use smooth (bilinear) or fixed (discrete) sampling
- optimized processing – improves how attention weights mix information
The model is meant to be used on images resized to a size 640x640 with the corresponding ImageProcessor. Reshaping to other sizes will generally degrade performance.
>>> import torch>>> import requests
>>> from PIL import Image>>> from transformers import RTDetrV2ForObjectDetection, RTDetrImageProcessor
>>> url = 'http://images.cocodataset.org/val2017/000000039769.jpg'>>> image = Image.open(requests.get(url, stream=True).raw)
>>> image_processor = RTDetrImageProcessor.from_pretrained("PekingU/rtdetr_v2_r18vd")>>> model = RTDetrV2ForObjectDetection.from_pretrained("PekingU/rtdetr_v2_r18vd")
>>> inputs = image_processor(images=image, return_tensors="pt")
>>> with torch.no_grad():... outputs = model(**inputs)
>>> results = image_processor.post_process_object_detection(outputs, target_sizes=torch.tensor([(image.height, image.width)]), threshold=0.5)
>>> for result in results:... for score, label_id, box in zip(result["scores"], result["labels"], result["boxes"]):... score, label = score.item(), label_id.item()... box = [round(i, 2) for i in box.tolist()]... print(f"{model.config.id2label[label]}: {score:.2f} {box}")cat: 0.97 [341.14, 25.11, 639.98, 372.89]cat: 0.96 [12.78, 56.35, 317.67, 471.34]remote: 0.95 [39.96, 73.12, 175.65, 117.44]sofa: 0.86 [-0.11, 2.97, 639.89, 473.62]sofa: 0.82 [-0.12, 1.78, 639.87, 473.52]remote: 0.79 [333.65, 76.38, 370.69, 187.48]Resources
Section titled “Resources”A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with RT-DETRv2.
- Scripts for finetuning
RTDetrV2ForObjectDetectionwithTraineror Accelerate can be found here. - See also: Object detection task guide.
- Notebooks for inference and fine-tuning RT-DETRv2 on a custom dataset (🌎).
RTDetrV2Config
Section titled “RTDetrV2Config”[[autodoc]] RTDetrV2Config
RTDetrV2Model
Section titled “RTDetrV2Model”[[autodoc]] RTDetrV2Model - forward
RTDetrV2ForObjectDetection
Section titled “RTDetrV2ForObjectDetection”[[autodoc]] RTDetrV2ForObjectDetection - forward