bconsolvo kellibelcher commited on
Commit
8fc3e11
·
verified ·
1 Parent(s): 0a85b5e

Update README.md (#3)

Browse files

- Update README.md (37c6e01e37bf0215fc3fff5f8a16fe023a42624a)


Co-authored-by: Kelli B <[email protected]>

Files changed (1) hide show
  1. README.md +93 -27
README.md CHANGED
@@ -1,44 +1,71 @@
1
  ---
2
- license: creativeml-openrail-m
3
- datasets:
4
- - laion/laion400m
5
- tags:
6
- - stable-diffusion
7
- - stable-diffusion-diffusers
8
- - text-to-image
9
  language:
10
- - en
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  pipeline_tag: text-to-3d
 
12
  ---
13
 
14
- # LDM3D-VR model
15
 
16
- The LDM3D-VR model was proposed in ["LDM3D-VR: Latent Diffusion Model for 3D"](https://arxiv.org/pdf/2311.03226.pdf) by Gabriela Ben Melech Stan, Diana Wofk, Estelle Aflalo, Shao-Yen Tseng, Zhipeng Cai, Michael Paulitsch, Vasudev Lal.
17
 
18
- LDM3D-VR got accepted to [NeurIPS Workshop'23 on Diffusion Models][https://neurips.cc/virtual/2023/workshop/66539].
19
 
20
- This new checkpoint related to the upscaler called LDM3D-sr.
21
 
22
- # Model description
23
- The abstract from the paper is the following: Latent diffusion models have proven to be state-of-the-art in the creation and manipulation of visual outputs. However, as far as we know, the generation of depth maps jointly with RGB is still limited. We introduce LDM3D-VR, a suite of diffusion models targeting virtual reality development that includes LDM3D-pano
24
- and LDM3D-SR. These models enable the generation of panoramic RGBD based on textual prompts and the upscaling of low-resolution inputs to high-resolution RGBD, respectively. Our models are fine-tuned from existing pretrained models on datasets containing panoramic/high-resolution RGB images, depth maps and captions. Both models are evaluated in comparison to existing related methods.
25
 
26
  ![LDM3D overview](model_overview.png)
27
- <font size="2">LDM3D overview taken from [the original paper](https://arxiv.org/abs/2305.10853)</font>
28
 
29
 
30
- ### How to use
31
 
32
- Here is how to use this model to get the features of a given text in PyTorch:
33
- ```python
34
 
 
35
  from diffusers import StableDiffusionLDM3DPipeline
36
 
37
  pipe = StableDiffusionLDM3DPipeline.from_pretrained("Intel/ldm3d-pano")
38
- pipe.to("cuda")
39
 
 
 
40
 
41
- prompt ="360 view of a large bedroom"
 
 
 
42
  name = "bedroom_pano"
43
 
44
  output = pipe(
@@ -58,20 +85,58 @@ This is the result:
58
 
59
  ![ldm3d_results](ldm3d_pano_results.png)
60
 
 
61
 
62
- ### Finetuning
63
 
64
- This checkpoint finetunes the previous [ldm3d-4c](https://huggingface.co/Intel/ldm3d-4c) on 2 panoramic-images datasets:
65
  - [polyhaven](https://polyhaven.com/): 585 images for the training set, 66 images for the validation set
66
  - [ihdri](https://www.ihdri.com/hdri-skies-outdoor/): 57 outdoor images for the training set, 7 outdoor images for the validation set.
67
-
68
 
69
- These datasets were augmented using [Text2Light](https://frozenburning.github.io/projects/text2light/) to create a dataset containing 13852 training samples and 1606 validation samples.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
70
 
71
- In order to generate the depth map of those samples, we used [DPT-large](https://github.com/isl-org/MiDaS) and to generate the caption we used [BLIP-2](https://huggingface.co/docs/transformers/main/model_doc/blip-2)
72
 
 
73
 
74
  ### BibTeX entry and citation info
 
75
  @misc{stan2023ldm3dvr,
76
  title={LDM3D-VR: Latent Diffusion Model for 3D VR},
77
  author={Gabriela Ben Melech Stan and Diana Wofk and Estelle Aflalo and Shao-Yen Tseng and Zhipeng Cai and Michael Paulitsch and Vasudev Lal},
@@ -79,4 +144,5 @@ In order to generate the depth map of those samples, we used [DPT-large](https:/
79
  eprint={2311.03226},
80
  archivePrefix={arXiv},
81
  primaryClass={cs.CV}
82
- }
 
 
1
  ---
 
 
 
 
 
 
 
2
  language:
3
+ - en
4
+ tags:
5
+ - stable-diffusion
6
+ - stable-diffusion-diffusers
7
+ - text-to-image
8
+ - text-to-panoramic
9
+ model-index:
10
+ - name: ldm3d-pano
11
+ results:
12
+ - task:
13
+ name: Latent Diffusion Model for 3D - Pano
14
+ type: latent-diffusion-model-for-3D-pano
15
+ dataset:
16
+ name: LAION-400M
17
+ type: laion/laion400m
18
+ metrics:
19
+ - name: FID
20
+ type: FID
21
+ value: 118.07
22
+ - name: IS
23
+ type: IS
24
+ value: 4.687
25
+ - name: CLIPsim
26
+ type: CLIPsim
27
+ value: 27.210
28
+ - name: MARE
29
+ type: MARE
30
+ value: 1.54
31
+ - name: ≤90%ile
32
+ type: ≤90%ile
33
+ value: 0.79
34
  pipeline_tag: text-to-3d
35
+ license: creativeml-openrail-m
36
  ---
37
 
38
+ # LDM3D-Pano model
39
 
40
+ The LDM3D-VR model suite was proposed in the paper [LDM3D-VR: Latent Diffusion Model for 3D](https://arxiv.org/pdf/2311.03226.pdf), authored by Gabriela Ben Melech Stan, Diana Wofk, Estelle Aflalo, Shao-Yen Tseng, Zhipeng Cai, Michael Paulitsch, and Vasudev Lal.
41
 
42
+ LDM3D-VR was accepted to the [NeurIPS 2023 Workshop on Diffusion Models](https://neurips.cc/virtual/2023/workshop/66539).
43
 
44
+ This new checkpoint, LDM3D-pano extends the [LDM3D-4c](https://huggingface.co/Intel/ldm3d-4c) model to panoramic image generation.
45
 
46
+ ## Model details
47
+ The abstract from the paper is the following: Latent diffusion models have proven to be state-of-the-art in the creation and manipulation of visual outputs. However, as far as we know, the generation of depth maps jointly with RGB is still limited. We introduce LDM3D-VR, a suite of diffusion models targeting virtual reality development that includes LDM3D-pano and LDM3D-SR. These models enable the generation of panoramic RGBD based on textual prompts and the upscaling of low-resolution inputs to high-resolution RGBD, respectively. Our models are fine-tuned from existing pretrained models on datasets containing panoramic/high-resolution RGB images, depth maps and captions. Both models are evaluated in comparison to existing related methods.
 
48
 
49
  ![LDM3D overview](model_overview.png)
50
+ <font size="2">LDM3D overview taken from the [LDM3D paper](https://arxiv.org/abs/2305.10853).</font>
51
 
52
 
53
+ ## Usage
54
 
55
+ Here is how to use this model with PyTorch on both a CPU and GPU architecture:
 
56
 
57
+ ```python
58
  from diffusers import StableDiffusionLDM3DPipeline
59
 
60
  pipe = StableDiffusionLDM3DPipeline.from_pretrained("Intel/ldm3d-pano")
 
61
 
62
+ # On CPU
63
+ pipe.to("cpu")
64
 
65
+ # On GPU
66
+ pipe.to("cuda")
67
+
68
+ prompt = "360 view of a large bedroom"
69
  name = "bedroom_pano"
70
 
71
  output = pipe(
 
85
 
86
  ![ldm3d_results](ldm3d_pano_results.png)
87
 
88
+ ## Training data
89
 
90
+ The LDM3D model was fine-tuned on a dataset constructed from a subset of the LAION-400M dataset, a large-scale image-caption dataset that contains over 400 million image-caption pairs. An additional subset of LAION Aesthetics 6+ with tuples (captions, 512 x 512-sized images and depth maps from DPT-BEiT-L-512) is used to fine-tune the LDM3D-VR.
91
 
92
+ This checkpoint uses two panoramic-image datasets to further fine-tune the [LDM3D-4c](https://huggingface.co/Intel/ldm3d-4c):
93
  - [polyhaven](https://polyhaven.com/): 585 images for the training set, 66 images for the validation set
94
  - [ihdri](https://www.ihdri.com/hdri-skies-outdoor/): 57 outdoor images for the training set, 7 outdoor images for the validation set.
 
95
 
96
+ These datasets were augmented using [Text2Light](https://frozenburning.github.io/projects/text2light/) to create a dataset containing 13,852 training samples and 1,606 validation samples.
97
+
98
+ In order to generate the depth map of those samples, we used [DPT-large](https://github.com/isl-org/MiDaS) and to generate the caption we used [BLIP-2](https://huggingface.co/docs/transformers/main/model_doc/blip-2).
99
+
100
+ ### Finetuning
101
+
102
+ We adopt a multi-stage fine-tuning procedure. We first fine-tune the refined version of the KL-autoencoder in [LDM3D-4c](https://huggingface.co/Intel/ldm3d-4c). Subsequently, the U-Net backbone is fine-tuned based on Stable Diffusion (SD) v1.5. The U-Net is then further fine-tuned on our panoramic image dataset.
103
+
104
+ ## Evaluation results
105
+
106
+ The table below shows the quantitative results of the text-to-pano image metrics at 512 x 1024, evaluated on 332 samples from the validation set.
107
+
108
+ |Method |FID ↓ |IS ↑ |CLIPsim ↑ |
109
+ |----------|------|----------|-----------|
110
+ |Text2light|108.30|4.646±0.27|27.083±3.65|
111
+ |LDM3D-pano|118.07|4.687±0.50|27.210±3.24|
112
+
113
+ The following table shows the quantitative results of the pano depth metrics at 512 x 1024. Reference depth is from DPT-BEiT-L-512.
114
+
115
+ |Method |MARE ↓ |≤90%ile |
116
+ |----------|---------|---------|
117
+ |Joint_3D60|1.75±2.87|0.92±0.87|
118
+ |LDM3D-pano|1.54±2.55|0.79±0.77|
119
+
120
+ The results above can be referenced in Table 1 and Table 2 of the [LDM3D-VR paper](https://arxiv.org/pdf/2311.03226.pdf).
121
+
122
+ ## Ethical Considerations and Limitations
123
+
124
+ For image generation, the [Stable Diffusion](https://huggingface.co/CompVis/stable-diffusion-v1-4#limitations) limitations and biases apply. For depth map generation, a first limitiation is that we are using DPT-large to produce the ground truth, hence, other limitations and biases from [DPT](https://huggingface.co/Intel/dpt-large) are applicable.
125
+
126
+ ## Caveats and Recommendations
127
+
128
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.
129
+
130
+ Here are a couple of useful links to learn more about Intel's AI software:
131
+ * [Intel Extension for PyTorch](https://github.com/intel/intel-extension-for-pytorch)
132
+ * [Intel Neural Compressor](https://github.com/intel/neural-compressor)
133
 
134
+ ## Disclaimer
135
 
136
+ The license on this model does not constitute legal advice. We are not responsible for the actions of third parties who use this model. Please consult an attorney before using this model for commercial purposes.
137
 
138
  ### BibTeX entry and citation info
139
+ ```bibtex
140
  @misc{stan2023ldm3dvr,
141
  title={LDM3D-VR: Latent Diffusion Model for 3D VR},
142
  author={Gabriela Ben Melech Stan and Diana Wofk and Estelle Aflalo and Shao-Yen Tseng and Zhipeng Cai and Michael Paulitsch and Vasudev Lal},
 
144
  eprint={2311.03226},
145
  archivePrefix={arXiv},
146
  primaryClass={cs.CV}
147
+ }
148
+ ```