patrickvonplaten dg845 commited on
Commit
233b9cd
·
0 Parent(s):

Duplicate from dg845/diffusers-ct_imagenet64

Browse files

Co-authored-by: Daniel Gu <[email protected]>

.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,122 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - generative model
5
+ - unconditional image generation
6
+ duplicated_from: dg845/diffusers-ct_imagenet64
7
+ ---
8
+ Consistency models are a new class of generative models introduced in ["Consistency Models"](https://arxiv.org/abs/2303.01469) ([paper](https://arxiv.org/pdf/2303.01469.pdf), [code](https://github.com/openai/consistency_models)) by Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever.
9
+ From the paper abstract:
10
+
11
+ > Diffusion models have significantly advanced the fields of image, audio, and video generation, but
12
+ they depend on an iterative sampling process that causes slow generation. To overcome this limitation,
13
+ we propose consistency models, a new family of models that generate high quality samples by directly
14
+ mapping noise to data. They support fast one-step generation by design, while still allowing multistep
15
+ sampling to trade compute for sample quality. They also support zero-shot data editing, such as image
16
+ inpainting, colorization, and super-resolution, without requiring explicit training on these tasks.
17
+ Consistency models can be trained either by distilling pre-trained diffusion models, or as standalone
18
+ generative models altogether. Through extensive experiments, we demonstrate that they outperform
19
+ existing distillation techniques for diffusion models in one- and few-step sampling, achieving the new
20
+ state-of-the-art FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64 x 64 for one-step generation. When
21
+ trained in isolation, consistency models become a new family of generative models that can outperform
22
+ existing one-step, non-adversarial generative models on standard benchmarks such as CIFAR-10, ImageNet
23
+ 64 x 64 and LSUN 256 x 256.
24
+
25
+ Intuitively, a consistency model can be thought of as a model which, when evaluated on a noisy image and timestep, returns an output image sample similar to that which would be returned by running a sampling algorithm on a diffusion model.
26
+ Consistency models can be parameterized by any neural network whose input has the same dimensionality as its output, such as a U-Net.
27
+
28
+ More precisely, given a teacher diffusion model and fixed sampler, we can train ("distill") a consistency model such that when it is given a noisy image and its corresponding timestep, the output sample of the consistency model will be close to the output that would result by using the sampler on the diffusion model to produce a sample, starting at the same noisy image and timestep.
29
+ The authors call this procedure "consistency distillation (CD)".
30
+ Consistency models can also be trained from scratch to generate clean images from a noisy image and timestep, which the authors call "consistency training (CT)".
31
+
32
+ This model is a `diffusers`-compatible version of the [ct_imagenet64.pt](https://github.com/openai/consistency_models#pre-trained-models) checkpont from the [original code and model release](https://github.com/openai/consistency_models).
33
+ This model was trained on the ImageNet 64x64 dataset using the consistency training (CT) algorithm.
34
+ See the [original model card](https://github.com/openai/consistency_models/blob/main/model-card.md) for more information.
35
+
36
+ ## Download
37
+
38
+ The original PyTorch model checkpoint can be downloaded from the [original code and model release](https://github.com/openai/consistency_models#pre-trained-models).
39
+
40
+ The `diffusers` pipeline for the `ct_imagenet64` model can be downloaded as follows:
41
+
42
+ ```python
43
+ from diffusers import ConsistencyModelPipeline
44
+
45
+ pipe = ConsistencyModelPipeline.from_pretrained("dg845/diffusers-ct_imagenet64")
46
+ ```
47
+
48
+ ## Usage
49
+
50
+ The original model checkpoint can be used with the [original consistency models codebase](https://github.com/openai/consistency_models).
51
+
52
+ Here is an example of using the `ct_imagenet64` checkpoint with `diffusers`:
53
+
54
+ ```python
55
+ import torch
56
+
57
+ from diffusers import ConsistencyModelPipeline
58
+
59
+ device = "cuda"
60
+ # Load the ct_imagenet64 checkpoint.
61
+ model_id_or_path = "dg845/diffusers-ct_imagenet64"
62
+ pipe = ConsistencyModelPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16)
63
+ pipe.to(device)
64
+
65
+ # Onestep Sampling
66
+ image = pipe(num_inference_steps=1).images[0]
67
+ image.save("ct_imagenet64_onestep_sample.png")
68
+
69
+ # Onestep sampling, class-conditional image generation
70
+ # ImageNet-64 class label 145 corresponds to king penguins
71
+ image = pipe(num_inference_steps=1, class_labels=145).images[0]
72
+ image.save("ct_imagenet64_onestep_sample_penguin.png")
73
+
74
+ # Multistep sampling, class-conditional image generation
75
+ # Timesteps can be explicitly specified; the particular timesteps below are from the original Github repo:
76
+ # https://github.com/openai/consistency_models/blob/main/scripts/launch.sh#L80
77
+ image = pipe(num_inference_steps=None, timesteps=[106, 0], class_labels=145).images[0]
78
+ image.save("ct_imagenet64_multistep_sample_penguin.png")
79
+ ```
80
+
81
+ ## Model Details
82
+ - **Model type:** Consistency model unconditional image generation model
83
+ - **Dataset:** ImageNet 64x64
84
+ - **License:** MIT
85
+ - **Model Description:** This model performs unconditional image generation. Its main component is a U-Net, which parameterizes the consistency model. This model was trained by the Consistency Model authors.
86
+ - **Resources for more information:**: [Paper](https://arxiv.org/abs/2303.01469), [GitHub Repository](https://github.com/openai/consistency_models), [Original Model Card](/openai/consistency_models/blob/main/model-card.md)
87
+
88
+ ## Datasets
89
+
90
+ _Note: This section is taken from the ["Datasets" section of the original model card](https://github.com/openai/consistency_models/blob/main/model-card.md#datasets)_.
91
+
92
+ The models that we are making available have been trained on the [ILSVRC 2012 subset of ImageNet](http://www.image-net.org/challenges/LSVRC/2012/) or on individual categories from [LSUN](https://arxiv.org/abs/1506.03365). Here we outline the characteristics of these datasets that influence the behavior of the models:
93
+
94
+ **ILSVRC 2012 subset of ImageNet**: This dataset was curated in 2012 and has around a million pictures, each of which belongs to one of 1,000 categories. A significant number of the categories in this dataset are animals, plants, and other naturally occurring objects. Although many photographs include humans, these humans are typically not represented by the class label (for example, the category "Tench, tinca tinca" includes many photographs of individuals holding fish).
95
+
96
+ **LSUN**: This dataset was collected in 2015 by a combination of human labeling via Amazon Mechanical Turk and automated data labeling. Both classes that we consider have more than a million images. The dataset creators discovered that when assessed by trained experts, the label accuracy was approximately 90% throughout the entire LSUN dataset. The pictures are gathered from the internet, and those in the cat class often follow a "meme" format. Occasionally, people, including faces, appear in these photographs.
97
+
98
+ ## Performance
99
+
100
+ _Note: This section is taken from the ["Performance" section of the original model card](https://github.com/openai/consistency_models/blob/main/model-card.md#performance)_.
101
+
102
+ These models are intended to generate samples consistent with their training distributions.
103
+ This has been measured in terms of FID, Inception Score, Precision, and Recall.
104
+ These metrics all rely on the representations of a [pre-trained Inception-V3 model](https://arxiv.org/abs/1512.00567),
105
+ which was trained on ImageNet, and so is likely to focus more on the ImageNet classes (such as animals) than on other visual features (such as human faces).
106
+
107
+ ## Intended Use
108
+
109
+ _Note: This section is taken from the ["Intended Use" section of the original model card](https://github.com/openai/consistency_models/blob/main/model-card.md#intended-use)_.
110
+
111
+ These models are intended to be used for research purposes only. In particular, they can be used as a baseline for generative modeling research, or as a starting point for advancing such research. These models are not intended to be commercially deployed. Additionally, they are not intended to be used to create propaganda or offensive imagery.
112
+
113
+ ## Limitations
114
+
115
+ _Note: This section is taken from the ["Limitations" section of the original model card](https://github.com/openai/consistency_models/blob/main/model-card.md#limitations)_.
116
+
117
+ These models sometimes produce highly unrealistic outputs, particularly when generating images containing human faces.
118
+ This may stem from ImageNet's emphasis on non-human objects.
119
+
120
+ In consistency distillation and training, minimizing LPIPS results in better sample quality, as evidenced by improved FID and Inception scores. However, it also carries the risk of overestimating model performance, because LPIPS uses a VGG network pre-trained on ImageNet, while FID and Inception scores also rely on convolutional neural networks (the Inception network in particular) pre-trained on the same ImageNet dataset. Although these two convolutional neural networks do not share the same architecture and we extract latents from them in substantially different ways, knowledge leakage is still plausible which can undermine the fidelity of FID and Inception scores.
121
+
122
+ Because ImageNet and LSUN contain images from the internet, they include photos of real people, and the model may have memorized some of the information contained in these photos. However, these images are already publicly available, and existing generative models trained on ImageNet have not demonstrated significant leakage of this information.
model_index.json ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "ConsistencyModelPipeline",
3
+ "_diffusers_version": "0.17.0.dev0",
4
+ "scheduler": [
5
+ "diffusers",
6
+ "CMStochasticIterativeScheduler"
7
+ ],
8
+ "unet": [
9
+ "diffusers",
10
+ "UNet2DModel"
11
+ ]
12
+ }
scheduler/scheduler_config.json ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "CMStochasticIterativeScheduler",
3
+ "_diffusers_version": "0.17.0.dev0",
4
+ "clip_denoised": true,
5
+ "num_train_timesteps": 201,
6
+ "rho": 7.0,
7
+ "s_noise": 1.0,
8
+ "sigma_data": 0.5,
9
+ "sigma_max": 80.0,
10
+ "sigma_min": 0.002
11
+ }
unet/config.json ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "UNet2DModel",
3
+ "_diffusers_version": "0.17.0.dev0",
4
+ "act_fn": "silu",
5
+ "add_attention": true,
6
+ "attention_head_dim": 64,
7
+ "block_out_channels": [
8
+ 192,
9
+ 384,
10
+ 576,
11
+ 768
12
+ ],
13
+ "center_input_sample": false,
14
+ "class_embed_type": null,
15
+ "down_block_types": [
16
+ "ResnetDownsampleBlock2D",
17
+ "AttnDownsampleBlock2D",
18
+ "AttnDownsampleBlock2D",
19
+ "AttnDownsampleBlock2D"
20
+ ],
21
+ "downsample_padding": 1,
22
+ "flip_sin_to_cos": true,
23
+ "freq_shift": 0,
24
+ "in_channels": 3,
25
+ "layers_per_block": 3,
26
+ "mid_block_scale_factor": 1,
27
+ "norm_eps": 1e-05,
28
+ "norm_num_groups": 32,
29
+ "num_class_embeds": 1000,
30
+ "out_channels": 3,
31
+ "resnet_time_scale_shift": "scale_shift",
32
+ "sample_size": 64,
33
+ "time_embedding_type": "positional",
34
+ "up_block_types": [
35
+ "AttnUpsampleBlock2D",
36
+ "AttnUpsampleBlock2D",
37
+ "AttnUpsampleBlock2D",
38
+ "ResnetUpsampleBlock2D"
39
+ ]
40
+ }
unet/diffusion_pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0d9b49667a9997fd6b8e30d14f78872a71165b2a0ae63b31f4741971992fc0eb
3
+ size 1183833415