Files changed (1) hide show
  1. README.md +134 -32
README.md CHANGED
@@ -1,63 +1,165 @@
1
  ---
2
  license: mit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
4
 
5
- # DPT 3.1 (Swinv2 backbone)
6
 
7
  DPT (Dense Prediction Transformer) model trained on 1.4 million images for monocular depth estimation. It was introduced in the paper [Vision Transformers for Dense Prediction](https://arxiv.org/abs/2103.13413) by Ranftl et al. (2021) and first released in [this repository](https://github.com/isl-org/MiDaS/tree/master).
8
 
9
- Disclaimer: The team releasing DPT did not write a model card for this model so this model card has been written by the Hugging Face team.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
 
11
  ## Model description
12
 
13
- This DPT model uses the [Swinv2](https://huggingface.co/docs/transformers/model_doc/swinv2) model as backbone and adds a neck + head on top for monocular depth estimation.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
 
15
- ![model image](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/dpt_architecture.jpg)
16
 
17
  ## How to use
18
 
19
- Here is how to use this model for zero-shot depth estimation on an image:
20
 
 
21
  ```python
22
- from transformers import DPTImageProcessor, DPTForDepthEstimation
23
  import torch
24
- import numpy as np
25
- from PIL import Image
26
- import requests
27
-
28
- url = "http://images.cocodataset.org/val2017/000000039769.jpg"
29
- image = Image.open(requests.get(url, stream=True).raw)
30
-
31
- processor = DPTImageProcessor.from_pretrained("Intel/dpt-swinv2-large-384")
32
- model = DPTForDepthEstimation.from_pretrained("Intel/dpt-swinv2-large-384")
33
 
34
- # prepare image for the model
35
- inputs = processor(images=image, return_tensors="pt")
36
 
37
- with torch.no_grad():
38
- outputs = model(**inputs)
39
- predicted_depth = outputs.predicted_depth
 
40
 
41
- # interpolate to original size
42
- prediction = torch.nn.functional.interpolate(
43
- predicted_depth.unsqueeze(1),
44
- size=image.size[::-1],
45
- mode="bicubic",
46
- align_corners=False,
47
- )
48
 
49
- # visualize the prediction
50
  output = prediction.squeeze().cpu().numpy()
51
  formatted = (output * 255 / np.max(output)).astype("uint8")
52
  depth = Image.fromarray(formatted)
 
53
  ```
54
 
55
  or one can use the pipeline API:
56
-
57
- ```python
58
  from transformers import pipeline
59
 
60
  pipe = pipeline(task="depth-estimation", model="Intel/dpt-swinv2-large-384")
61
- result = pipe("http://images.cocodataset.org/val2017/000000039769.jpg")
62
  result["depth"]
63
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ tags:
4
+ - vision
5
+ - depth-estimation
6
+
7
+ model-index:
8
+ - name: dpt-swinv2-large-384
9
+ results:
10
+ - task:
11
+ type: monocular-depth-estimation
12
+ name: Monocular Depth Estimation
13
+ dataset:
14
+ type: MIX-6
15
+ name: MIX-6
16
+ metrics:
17
+ - type: Zero-shot transfer
18
+ value: 10.82
19
+ name: Zero-shot transfer
20
+ config: Zero-shot transfer
21
+ verified: false
22
  ---
23
 
24
+ # Midas 3.1 DPT (Intel/dpt-swinv2-large-384 using Swinv2 backbone)
25
 
26
  DPT (Dense Prediction Transformer) model trained on 1.4 million images for monocular depth estimation. It was introduced in the paper [Vision Transformers for Dense Prediction](https://arxiv.org/abs/2103.13413) by Ranftl et al. (2021) and first released in [this repository](https://github.com/isl-org/MiDaS/tree/master).
27
 
28
+ **Disclaimer:** The team releasing DPT did not write a model card for this model so this model card has been written by Intel and the Hugging Face team.
29
+
30
+
31
+ # Overview of Monocular depth estimation
32
+
33
+ The aim of Monocular depth estimation is to infer detailed depth from a single image or camera view, finds applications in fields like generative AI, 3D reconstruction, and autonomous driving. However, deriving depth from individual pixels in a single image is challenging due to the under constrained nature of the problem. Recent advancements attribute progress to learning-based methods, particularly with MiDaS, leveraging dataset mixing and scale-and-shift-invariant loss. MiDaS has evolved with releases featuring more powerful backbones and lightweight variants for mobile use. With the rise of transformer architectures in computer vision, including those pioneered by models like ViT,and Swin, and SwinV2 there's been a shift towards using them for depth estimation. Inspired by this, MiDaS v3.1 incorporates promising transformer-based encoders alongside traditional convolutional ones, aiming for a comprehensive investigation of depth estimation techniques. The paper focuses on describing the integration of these backbones into MiDaS, providing a thorough comparison of different v3.1 models, and offering guidance on utilizing future backbones with MiDaS.
34
+
35
+ Swin Transformer (the name Swin stands for Shifted window) is initially described in arxiv, which capably serves as a general-purpose backbone for computer vision. It is basically a hierarchical Transformer whose representation is computed with shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection.
36
+
37
+ Swin Transformer achieves strong performance on COCO object detection (58.7 box AP and 51.1 mask AP on test-dev) and ADE20K semantic segmentation (53.5 mIoU on val), surpassing previous models by a large margin.
38
+
39
+ | Input Image | Output Depth Image |
40
+ | --- | --- |
41
+ | ![input image](https://cdn-uploads.huggingface.co/production/uploads/63dc702662dc193e6d460f1b/PDwRwuryaO3YtuyRjraiM.jpeg) | ![Depth image](https://cdn-uploads.huggingface.co/production/uploads/63dc702662dc193e6d460f1b/ugqri6LcqJBuU9zI9aeqN.jpeg) |
42
+
43
+ # Videos
44
+
45
+ ![MiDaS Depth Estimation | Intel Technology](https://cdn-uploads.huggingface.co/production/uploads/641bd18baebaa27e0753f2c9/u-KwRFIQhMWiFraSTTBkc.png)
46
+
47
+ MiDaS Depth Estimation is a machine learning model from Intel Labs for monocular depth estimation. It was trained on up to 12 datasets and covers both in-and outdoor scenes. Multiple different MiDaS models are available, ranging from high quality depth estimation to lightweight models for mobile downstream tasks (https://github.com/isl-org/MiDaS).
48
+
49
 
50
  ## Model description
51
 
52
+ This Midas 3.1 DPT model uses the [SwinV2 Philosophy]( https://huggingface.co/docs/transformers/en/model_doc/swinv2) model as backbone and uses a different approach to Vision that Beit, where Swin backbones focus more on using a hierarchical approach.
53
+ ![model image]( https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/swin_transformer_architecture.png)
54
+
55
+ The previous release MiDaS v3.0 solely leverages the
56
+ vanilla vision transformer ViT, MiDaS v3.1 offers additional models based on BEiT, Swin, SwinV2, Next-ViT and LeViT.
57
+
58
+ # Midas 3.1 DPT Model(Swin backbone)
59
+ This model refers to Intel dpt-swinv2-large-384 based on the Swin backbone. The arxiv paper compares both Beit and Swin backbones.
60
+ The highest quality depth estimation is achieved using the BEiT transformer. We provide variants such as Swin-L, SwinV2-L, SwinV2-B, SwinV2-T, where the numbers signify training resolutions of 512x512 and 384x384, while the letters denote large and base models respectively.
61
+
62
+ DPT (Dense Prediction Transformer) model trained on 1.4 million images for monocular depth estimation. It was introduced in the paper [Vision Transformers for Dense Prediction](https://arxiv.org/abs/2103.13413) by Ranftl et al. (2021) and first released in [this repository](https://github.com/isl-org/MiDaS/tree/master).
63
+
64
+ This model card refers specifically to SwinV2, in the paper, and is referred to dpt-swinv2-large-384. A more recent paper from 2013, specifically discussing Swin and SwinV2, is in this paper [MiDaS v3.1 – A Model Zoo for Robust Monocular Relative Depth Estimation
65
+ ](https://arxiv.org/pdf/2307.14460.pdf)
66
+
67
+ The model card has been written in combination by the Hugging Face team and Intel.
68
+
69
+ | Model Detail | Description |
70
+ | ----------- | ----------- |
71
+ | Model Authors - Company | Intel |
72
+ | Date | March 18, 2024 |
73
+ | Version | 1 |
74
+ | Type | Computer Vision - Monocular Depth Estimation |
75
+ | Paper or Other Resources | [MiDaS v3.1 – A Model Zoo for Robust Monocular Relative Depth Estimation](https://arxiv.org/pdf/2307.14460.pdf) and [GitHub Repo](https://github.com/isl-org/MiDaS/blob/master/README.md) |
76
+ | License | MIT |
77
+ | Questions or Comments | [Community Tab](https://huggingface.co/Intel/dpt-swinv2-large-384/discussions) and [Intel Developers Discord](https://discord.gg/rv2Gp55UJQ)|
78
+
79
+ | Intended Use | Description |
80
+ | ----------- | ----------- |
81
+ | Primary intended uses | You can use the raw model for zero-shot monocular depth estimation. See the [model hub](https://huggingface.co/models?search=dpt-beit-large) to look for fine-tuned versions on a task that interests you. |
82
+ | Primary intended users | Anyone doing monocular depth estimation |
83
+ | Out-of-scope uses | This model in most cases will need to be fine-tuned for your particular task. The model should not be used to intentionally create hostile or alienating environments for people.|
84
+
85
 
 
86
 
87
  ## How to use
88
 
89
+ Be sure the to update PyTorch as Transformers as mismatches in versions can generate erros such as: "TypeError: unsupported operand type(s) for //: 'NoneType' and 'NoneType'".
90
 
91
+ As tested by this contributor, the following versions ran correctly:
92
  ```python
 
93
  import torch
94
+ import transformers
95
+ print(torch.__version__)
96
+ print(transformers.__version__)
97
+ ```
98
+ ```bash
99
+ out: '2.2.1+cpu'
100
+ out: '4.37.2'
101
+ ```
 
102
 
103
+ ### To Install:
 
104
 
105
+ ```pythopn
106
+ pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
107
+
108
+ ```
109
 
110
+ # To Use:
111
+ Here is how to use this model for zero-shot depth estimation on an image:
 
 
 
 
 
112
 
113
+ ```python
114
  output = prediction.squeeze().cpu().numpy()
115
  formatted = (output * 255 / np.max(output)).astype("uint8")
116
  depth = Image.fromarray(formatted)
117
+ depth
118
  ```
119
 
120
  or one can use the pipeline API:
 
 
121
  from transformers import pipeline
122
 
123
  pipe = pipeline(task="depth-estimation", model="Intel/dpt-swinv2-large-384")
124
+ result = pipe("http://images.cocodataset.org/val2017/000000181816.jpg")
125
  result["depth"]
126
+ ```
127
+
128
+ ## Quantitative Analyses
129
+ | Model | Square Resolution HRWSI RMSE | Square Resolution Blended MVS REL | Square Resolution ReDWeb RMSE |
130
+ | --- | --- | --- | --- |
131
+ | BEiT 384-L | 0.068 | 0.070 | 0.076 |
132
+ | Swin-L Training 1| 0.0708 | 0.0724 | 0.0826 |
133
+ | Swin-L Training 2 | 0.0713 | 0.0720 | 0.0831 |
134
+ | ViT-L | 0.071 | 0.072 | 0.082 |
135
+ | --- | --- | --- | --- |
136
+ | Next-ViT-L-1K-6M | 0.075 |0.073 | 0.085 |
137
+ | DeiT3-L-22K-1K | 0.070 | 0.070 | 0.080 |
138
+ | ViT-L-Hybrid | 0.075 | 0.075 | 0.085 |
139
+ | DeiT3-L | 0.077 | 0.075 | 0.087 |
140
+ | --- | --- | --- | --- |
141
+ | ConvNeXt-XL | 0.075 | 0.075 | 0.085 |
142
+ | ConvNeXt-L | 0.076 | 0.076 | 0.087 |
143
+ | EfficientNet-L2| 0.165 | 0.277 | 0.219 |
144
+ | --- | --- | --- | --- |
145
+ | ViT-L Reversed | 0.071 | 0.073 | 0.081 |
146
+ | Swin-L Equidistant | 0.072 | 0.074 | 0.083 |
147
+ | --- | --- | --- | --- |
148
+
149
+ ### BibTeX entry and citation info
150
+
151
+ ```bibtex
152
+ @article{DBLP:journals/corr/abs-2103-13413,
153
+ author = {Ren{\'{e}} Reiner Birkl, Diana Wofk, Matthias Muller},
154
+ title = {MiDaS v3.1 – A Model Zoo for Robust Monocular Relative Depth Estimation},
155
+ journal = {CoRR},
156
+ volume = {abs/2307.14460},
157
+ year = {2021},
158
+ url = {https://arxiv.org/abs/2307.14460},
159
+ eprinttype = {arXiv},
160
+ eprint = {2307.14460},
161
+ timestamp = {Wed, 26 Jul 2023},
162
+ biburl = {https://dblp.org/rec/journals/corr/abs-2307.14460.bib},
163
+ bibsource = {dblp computer science bibliography, https://dblp.org}
164
+ }
165
+ ```