|
--- |
|
library_name: transformers |
|
license: apache-2.0 |
|
datasets: |
|
- detection-datasets/coco |
|
language: |
|
- en |
|
pipeline_tag: object-detection |
|
--- |
|
|
|
# Relation DETR model with ResNet-50 backbone |
|
|
|
## Model Details |
|
|
|
The model is not available now. We are working on integrating Relation-DETR into transformers. We will update as soon as possible. |
|
|
|
### Model Description |
|
|
|
![image/png](/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F66939171e3a813f3bb10e804%2FkNzBZZ2SFq6Wgk2ki_c5t.png%3C%2Fspan%3E)%3C!-- HTML_TAG_END --> |
|
|
|
> This paper presents a general scheme for enhancing the convergence and performance of DETR (DEtection TRansformer). |
|
> We investigate the slow convergence problem in transformers from a new perspective, suggesting that it arises from |
|
> the self-attention that introduces no structural bias over inputs. To address this issue, we explore incorporating |
|
> position relation prior as attention bias to augment object detection, following the verification of its statistical |
|
> significance using a proposed quantitative macroscopic correlation (MC) metric. Our approach, termed Relation-DETR, |
|
> introduces an encoder to construct position relation embeddings for progressive attention refinement, which further |
|
> extends the traditional streaming pipeline of DETR into a contrastive relation pipeline to address the conflicts |
|
> between non-duplicate predictions and positive supervision. Extensive experiments on both generic and task-specific |
|
> datasets demonstrate the effectiveness of our approach. Under the same configurations, Relation-DETR achieves a |
|
> significant improvement (+2.0% AP compared to DINO), state-of-the-art performance (51.7% AP for 1x and 52.1% AP |
|
> for 2x settings), and a remarkably faster convergence speed (over 40% AP with only 2 training epochs) than existing |
|
> DETR detectors on COCO val2017. Moreover, the proposed relation encoder serves as a universal plug-in-and-play component, |
|
> bringing clear improvements for theoretically any DETR-like methods. Furthermore, we introduce a class-agnostic detection |
|
> dataset, SA-Det-100k. The experimental results on the dataset illustrate that the proposed explicit position relation |
|
> achieves a clear improvement of 1.3% AP, highlighting its potential towards universal object detection. |
|
> The code and dataset are available at [this https URL](https://github.com/xiuqhou/Relation-DETR). |
|
|
|
- **Developed by:** [Xiuquan Hou] |
|
- **Shared by:** Xiuquan Hou |
|
- **Model type:** Relation DETR |
|
- **License:** Apache-2.0 |
|
|
|
### Model Sources |
|
|
|
<!-- Provide the basic links for the model. --> |
|
|
|
- **Repository:** [https://github.com/xiuqhou/Relation-DETR](https://github.com/xiuqhou/Relation-DETR) |
|
- **Paper:** [Relation DETR: Exploring Explicit Position Relation Prior for Object Detection](https://arxiv.org/abs/2407.11699) |
|
<!-- - **Demo [optional]:** [More Information Needed] --> |
|
|
|
## How to Get Started with the Model |
|
|
|
Use the code below to get started with the model. |
|
|
|
```python |
|
import torch |
|
import requests |
|
|
|
from PIL import Image |
|
from transformers import RelationDetrForObjectDetection, RelationDetrImageProcessor |
|
|
|
url = 'http://images.cocodataset.org/val2017/000000039769.jpg' |
|
image = Image.open(requests.get(url, stream=True).raw) |
|
|
|
image_processor = RelationDetrImageProcessor.from_pretrained("PekingU/rtdetr_r50vd") |
|
model = RelationDetrForObjectDetection.from_pretrained("PekingU/rtdetr_r50vd") |
|
|
|
inputs = image_processor(images=image, return_tensors="pt") |
|
|
|
with torch.no_grad(): |
|
outputs = model(**inputs) |
|
|
|
results = image_processor.post_process_object_detection(outputs, target_sizes=torch.tensor([image.size[::-1]]), threshold=0.3) |
|
|
|
for result in results: |
|
for score, label_id, box in zip(result["scores"], result["labels"], result["boxes"]): |
|
score, label = score.item(), label_id.item() |
|
box = [round(i, 2) for i in box.tolist()] |
|
print(f"{model.config.id2label[label]}: {score:.2f} {box}") |
|
``` |
|
|
|
This should output |
|
|
|
```python |
|
cat: 0.96 [343.8, 24.9, 639.52, 371.71] |
|
cat: 0.95 [12.6, 54.34, 316.37, 471.86] |
|
remote: 0.95 [40.09, 73.49, 175.52, 118.06] |
|
remote: 0.90 [333.09, 76.71, 369.77, 187.4] |
|
couch: 0.90 [0.44, 0.53, 640.44, 475.54] |
|
``` |
|
|
|
## Training Details |
|
|
|
Relation DEtection TRansformer (Relation DETR) model is trained on [COCO 2017 object detection](https://cocodataset.org/#download) (118k annotated images) for 12 epochs (aka 1x schedule). |
|
|
|
## Evaluation |
|
|
|
| Model | Backbone | Epoch | mAP | AP<sub>50 | AP<sub>75 | AP<sub>S | AP<sub>M | AP<sub>L | |
|
| ------------------- | -------------------- | :---: | :---: | :-------: | :-------: | :------: | :------: | :------: | |
|
| Relation DETR | ResNet50 | 12 | 51.7 | 69.1 | 56.3 | 36.1 | 55.6 | 66.1 | |
|
| Relation DETR | Swin-L<sub>(IN-22K) | 12 | 57.8 | 76.1 | 62.9 | 41.2 | 62.1 | 74.4 | |
|
| Relation DETR | ResNet50 | 24 | 52.1 | 69.7 | 56.6 | 36.1 | 56.0 | 66.5 | |
|
| Relation DETR | Swin-L<sub>(IN-22K) | 24 | 58.1 | 76.4 | 63.5 | 41.8 | 63.0 | 73.5 | |
|
| Relation-DETR<sup>† | Focal-L<sub>(IN-22K) | 4+24 | 63.5 | 80.8 | 69.1 | 47.2 | 66.9 | 77.0 | |
|
|
|
† means finetuned model on COCO after pretraining on Object365. |
|
|
|
## Model Architecture and Objective |
|
|
|
![image/png](/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F66939171e3a813f3bb10e804%2FUMtLjkxrwoDikUBlgj-Fc.png%3C%2Fspan%3E)%3C!-- HTML_TAG_END --> |
|
|
|
![image/png](/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F66939171e3a813f3bb10e804%2FMBbCM-zQGgUjKUmwB0yje.png%3C%2Fspan%3E)%3C!-- HTML_TAG_END --> |
|
|
|
## Citation and BibTeX |
|
|
|
``` |
|
@misc{hou2024relationdetrexploringexplicit, |
|
title={Relation DETR: Exploring Explicit Position Relation Prior for Object Detection}, |
|
author={Xiuquan Hou and Meiqin Liu and Senlin Zhang and Ping Wei and Badong Chen and Xuguang Lan}, |
|
year={2024}, |
|
eprint={2407.11699}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CV}, |
|
url={https://arxiv.org/abs/2407.11699}, |
|
} |
|
``` |
|
|
|
## Model Card Authors |
|
|
|
[xiuqhou](https://huggingface.co/xiuqhou) |
|
|