File size: 3,109 Bytes
499a66e
 
7a6d5a0
499a66e
7a6d5a0
fd6b1e9
7a6d5a0
81a553d
 
7a6d5a0
 
81a553d
 
 
 
 
 
 
cf4db73
 
 
 
 
 
 
 
 
 
 
 
7a6d5a0
 
 
81a553d
cf4db73
 
81a553d
 
 
 
7a6d5a0
81a553d
cf4db73
81a553d
 
 
cf4db73
81a553d
 
 
 
cf4db73
81a553d
 
 
 
 
 
 
8185317
81a553d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cf4db73
81a553d
 
 
cf4db73
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
---
license: mit
library_name: diffusers
---

# Stage-A-ft-HQ

`stage-a-ft-hq` is a version of [Würstchen](https://huggingface.co/warp-ai/wuerstchen)'s **Stage A** that was finetuned to have slightly-nicer-looking textures.

`stage-a-ft-hq` works with any Würstchen-derived model (including [Stable Cascade](https://huggingface.co/stabilityai/stable-cascade)).

## Example comparison

| Stable Cascade                    | Stable Cascade + `stage-a-ft-hq`   |
| --------------------------------- | ---------------------------------- |
| ![](example_baseline.png)         | ![](example_finetuned.png)         |
| ![](example_baseline_closeup.png) | ![](example_finetuned_closeup.png) |

## Explanation

Image generators like Würstchen and Stable Cascade create images via a multi-stage process.
Stage A is the ultimate stage, responsible for rendering out full-resolution, human-interpretable images (based on the output from prior stages).

The original Stage A tends to render slightly-smoothed-out images with a distinctive noise pattern on top.

`stage-a-ft-hq` was finetuned briefly on a high-quality dataset in order to reduce these artifacts.

## Suggested Settings

To generate highly detailed images, you probably want to use `stage-a-ft-hq` (which improves very fine detail) in combination with a large Stage B step count (which [improves mid-level detail](https://old.reddit.com/r/StableDiffusion/comments/1ar359h/cascade_can_generate_directly_at_1536x1536_and/kqhjtk5/)).

## 🧨 Diffusers Usage

⚠️ As of 2024-02-17, Stable Cascade's [PR](https://github.com/huggingface/diffusers/pull/6487) is still under review.
I've only tested Stable Cascade with this particular version of the PR:

```bash
pip install --upgrade --force-reinstall https://github.com/kashif/diffusers/archive/a3dc21385b7386beb3dab3a9845962ede6765887.zip
```

```py
import torch
device = "cuda"

# Load the Stage-A-ft-HQ model
from diffusers.pipelines.wuerstchen import PaellaVQModel
stage_a_ft_hq = PaellaVQModel.from_pretrained("madebyollin/stage-a-ft-hq", torch_dtype=torch.float16).to(device)

# Load the normal Stable Cascade pipeline
from diffusers import StableCascadeDecoderPipeline, StableCascadePriorPipeline

num_images_per_prompt = 1

prior = StableCascadePriorPipeline.from_pretrained("stabilityai/stable-cascade-prior", torch_dtype=torch.bfloat16).to(device)
decoder = StableCascadeDecoderPipeline.from_pretrained("stabilityai/stable-cascade",  torch_dtype=torch.float16).to(device)

# Swap in the Stage-A-ft-HQ model
decoder.vqgan = stage_a_ft_hq

prompt = "Photograph of Seattle streets on a snowy winter morning"
negative_prompt = ""

prior_output = prior(
    prompt=prompt,
    height=1024,
    width=1024,
    negative_prompt=negative_prompt,
    guidance_scale=4.0,
    num_images_per_prompt=num_images_per_prompt,
    num_inference_steps=20
)
decoder_output = decoder(
    image_embeddings=prior_output.image_embeddings.half(),
    prompt=prompt,
    negative_prompt=negative_prompt,
    guidance_scale=0.0,
    output_type="pil",
    num_inference_steps=20
).images

display(decoder_output[0])
```