Does "Hyper-SD15-1step-lora.safetensors" work for img2img?

#10
by zjysteven - opened

Hi, I'm trying to use "Hyper-SD15-1step-lora.safetensors" for StableDiffusionControlNetImg2ImgPipeline. However, it seems that the used TCD scheduler will result in empty latents here
https://github.com/huggingface/diffusers/blob/39215aa30e54586419fd3aa1ee467cbee2db908e/src/diffusers/pipelines/controlnet/pipeline_controlnet_img2img.py#L863-L864

Specifically, in this line init_latents = self.scheduler.add_noise(init_latents, noise, timestep) while the input init_latents is of correct shape, say [1, 4, 64, 64], the output init_latents (after scheduler adding noise) will somehow be [0, 4, 64, 64]. Is this something expected or am I doing anything wrong here?

If this is expected with TCDScheduler, then is TCD a must for using Hyper-SD15-1step-lora.safetensors? What are other recommended schedulers? Thanks in advance.

ByteDance org

Hi, @zjysteven
Can you provide your example inference script so we can check it for you?

Please see this minimal script:

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '7'

import torch

from diffusers import (
    ControlNetModel, 
    StableDiffusionControlNetImg2ImgPipeline, 
    TCDScheduler
)
from diffusers.utils import load_image, make_image_grid
from huggingface_hub import hf_hub_download

controlnet = ControlNetModel.from_pretrained(
    'lllyasviel/control_v11f1e_sd15_tile', 
    torch_dtype=torch.float16
)

pipe = StableDiffusionControlNetImg2ImgPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    controlnet=controlnet,
    torch_dtype=torch.float16,
    safety_checker=None
).to('cuda')
pipe.enable_xformers_memory_efficient_attention()

pipe.scheduler = TCDScheduler.from_config(pipe.scheduler.config)
pipe.load_lora_weights(hf_hub_download("ByteDance/Hyper-SD", "Hyper-SD15-1step-lora.safetensors"))

original = load_image(
    'https://huggingface.co/lllyasviel/control_v11f1e_sd15_tile/resolve/main/images/original.png'
)
original = original.resize((512, 512))
low_res = original.resize((64, 64))

prompt = f"a dog sitting on the grass, realistic, best quality, extremely detailed"
negative_prompt = "monochrome, lowres, bad anatomy, worst quality, low quality"

generator = torch.manual_seed(2)
eta = 1.0
image = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    image=low_res, 
    control_image=low_res,
    width=512,
    height=512,
    num_inference_steps=1,
    guidance_scale=0.0,
    eta=eta, 
    strength=0.8,
    generator=generator,
).images[0]

Running it will yield the following error:

Traceback (most recent call last):
  File "/home/jz288/coadp_lcm_stream/hypersd_bug.py", line 41, in <module>
    image = pipe(
  File "/home/jz288/anaconda3/envs/vc/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/jz288/diffusers/src/diffusers/pipelines/controlnet/pipeline_controlnet_img2img.py", line 1302, in __call__
    image = self.image_processor.postprocess(image, output_type=output_type, do_denormalize=do_denormalize)
  File "/home/jz288/diffusers/src/diffusers/image_processor.py", line 603, in postprocess
    image = torch.stack(
RuntimeError: stack expects a non-empty TensorList

and after investigation the reason was what I mentioned above, the initialized latent is somehow of shape [0, 4, 64, 64] so the output image is [0, 3, 512, 512] and thus being an empty TensorList.

ByteDance org

Hi, @zjysteven
You would need to set strength=1.0 to get num_inference_steps=1 works.
The strength parameter controls the number of iterations utilized extra control, so the timestep would be None if strength < 1 since 1 * 0.8 = 0.8 < 1.

You are absolutely right. Can't imagine I missed this simple detail earlier. Thank you.

zjysteven changed discussion status to closed

Sign up or log in to comment