# Installation 🤗 Diffusers is tested on Python 3.8+, PyTorch 1.7.0+, and Flax. Follow the installation instructions below for the deep learning library you are using: - [PyTorch](https://pytorch.org/get-started/locally/) installation instructions - [Flax](https://flax.readthedocs.io/en/latest/) installation instructions ## Install with pip You should install 🤗 Diffusers in a [virtual environment](https://docs.python.org/3/library/venv.html). If you're unfamiliar with Python virtual environments, take a look at this [guide](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/). A virtual environment makes it easier to manage different projects and avoid compatibility issues between dependencies. Start by creating a virtual environment in your project directory: ```bash python -m venv .env ``` Activate the virtual environment: ```bash source .env/bin/activate ``` You should also install 🤗 Transformers because 🤗 Diffusers relies on its models: Note - PyTorch only supports Python 3.8 - 3.11 on Windows. ```bash pip install diffusers["torch"] transformers ``` ## Install with conda After activating your virtual environment, with `conda` (maintained by the community): ```bash conda install -c conda-forge diffusers ``` ## Install from source Before installing 🤗 Diffusers from source, make sure you have PyTorch and 🤗 Accelerate installed. To install 🤗 Accelerate: ```bash pip install accelerate ``` Then install 🤗 Diffusers from source: ```bash pip install git+https://github.com/huggingface/diffusers ``` This command installs the bleeding edge `main` version rather than the latest `stable` version. The `main` version is useful for staying up-to-date with the latest developments. For instance, if a bug has been fixed since the last official release but a new release hasn't been rolled out yet. However, this means the `main` version may not always be stable. We strive to keep the `main` version operational, and most issues are usually resolved within a few hours or a day. If you run into a problem, please open an [Issue](https://github.com/huggingface/diffusers/issues/new/choose) so we can fix it even sooner! ## Editable install You will need an editable install if you'd like to: * Use the `main` version of the source code. * Contribute to 🤗 Diffusers and need to test changes in the code. Clone the repository and install 🤗 Diffusers with the following commands: ```bash git clone https://github.com/huggingface/diffusers.git cd diffusers ``` ```bash pip install -e ".[torch]" ``` These commands will link the folder you cloned the repository to and your Python library paths. Python will now look inside the folder you cloned to in addition to the normal library paths. For example, if your Python packages are typically installed in `~/anaconda3/envs/main/lib/python3.10/site-packages/`, Python will also search the `~/diffusers/` folder you cloned to. You must keep the `diffusers` folder if you want to keep using the library. Now you can easily update your clone to the latest version of 🤗 Diffusers with the following command: ```bash cd ~/diffusers/ git pull ``` Your Python environment will find the `main` version of 🤗 Diffusers on the next run. ## Cache Model weights and files are downloaded from the Hub to a cache which is usually your home directory. You can change the cache location by specifying the `HF_HOME` or `HUGGINFACE_HUB_CACHE` environment variables or configuring the `cache_dir` parameter in methods like `from_pretrained()`. Cached files allow you to run 🤗 Diffusers offline. To prevent 🤗 Diffusers from connecting to the internet, set the `HF_HUB_OFFLINE` environment variable to `True` and 🤗 Diffusers will only load previously downloaded files in the cache. ```shell export HF_HUB_OFFLINE=True ``` For more details about managing and cleaning the cache, take a look at the [caching](https://huggingface.co/docs/huggingface_hub/guides/manage-cache) guide. ## Telemetry logging Our library gathers telemetry information during `from_pretrained()` requests. The data gathered includes the version of 🤗 Diffusers and PyTorch/Flax, the requested model or pipeline class, and the path to a pretrained checkpoint if it is hosted on the Hugging Face Hub. This usage data helps us debug issues and prioritize new features. Telemetry is only sent when loading models and pipelines from the Hub, and it is not collected if you're loading local files. We understand that not everyone wants to share additional information,and we respect your privacy. You can disable telemetry collection by setting the `DISABLE_TELEMETRY` environment variable from your terminal: On Linux/MacOS: ```bash export DISABLE_TELEMETRY=YES ``` On Windows: ```bash set DISABLE_TELEMETRY=YES ``` # Quicktour Diffusion models are trained to denoise random Gaussian noise step-by-step to generate a sample of interest, such as an image or audio. This has sparked a tremendous amount of interest in generative AI, and you have probably seen examples of diffusion generated images on the internet. 🧨 Diffusers is a library aimed at making diffusion models widely accessible to everyone. Whether you're a developer or an everyday user, this quicktour will introduce you to 🧨 Diffusers and help you get up and generating quickly! There are three main components of the library to know about: * The `DiffusionPipeline` is a high-level end-to-end class designed to rapidly generate samples from pretrained diffusion models for inference. * Popular pretrained [model](./api/models) architectures and modules that can be used as building blocks for creating diffusion systems. * Many different [schedulers](./api/schedulers/overview) - algorithms that control how noise is added for training, and how to generate denoised images during inference. The quicktour will show you how to use the `DiffusionPipeline` for inference, and then walk you through how to combine a model and scheduler to replicate what's happening inside the `DiffusionPipeline`. The quicktour is a simplified version of the introductory 🧨 Diffusers [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/diffusers_intro.ipynb) to help you get started quickly. If you want to learn more about 🧨 Diffusers' goal, design philosophy, and additional details about its core API, check out the notebook! Before you begin, make sure you have all the necessary libraries installed: ```py # uncomment to install the necessary libraries in Colab #!pip install --upgrade diffusers accelerate transformers ``` - [🤗 Accelerate](https://huggingface.co/docs/accelerate/index) speeds up model loading for inference and training. - [🤗 Transformers](https://huggingface.co/docs/transformers/index) is required to run the most popular diffusion models, such as [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview). ## DiffusionPipeline The `DiffusionPipeline` is the easiest way to use a pretrained diffusion system for inference. It is an end-to-end system containing the model and the scheduler. You can use the `DiffusionPipeline` out-of-the-box for many tasks. Take a look at the table below for some supported tasks, and for a complete list of supported tasks, check out the [🧨 Diffusers Summary](./api/pipelines/overview#diffusers-summary) table. | **Task** | **Description** | **Pipeline** |------------------------------|--------------------------------------------------------------------------------------------------------------|-----------------| | Unconditional Image Generation | generate an image from Gaussian noise | [unconditional_image_generation](./using-diffusers/unconditional_image_generation) | | Text-Guided Image Generation | generate an image given a text prompt | [conditional_image_generation](./using-diffusers/conditional_image_generation) | | Text-Guided Image-to-Image Translation | adapt an image guided by a text prompt | [img2img](./using-diffusers/img2img) | | Text-Guided Image-Inpainting | fill the masked part of an image given the image, the mask and a text prompt | [inpaint](./using-diffusers/inpaint) | | Text-Guided Depth-to-Image Translation | adapt parts of an image guided by a text prompt while preserving structure via depth estimation | [depth2img](./using-diffusers/depth2img) | Start by creating an instance of a `DiffusionPipeline` and specify which pipeline checkpoint you would like to download. You can use the `DiffusionPipeline` for any [checkpoint](https://huggingface.co/models?library=diffusers&sort=downloads) stored on the Hugging Face Hub. In this quicktour, you'll load the [`stable-diffusion-v1-5`](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5) checkpoint for text-to-image generation. For [Stable Diffusion](https://huggingface.co/CompVis/stable-diffusion) models, please carefully read the [license](https://huggingface.co/spaces/CompVis/stable-diffusion-license) first before running the model. 🧨 Diffusers implements a [`safety_checker`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/safety_checker.py) to prevent offensive or harmful content, but the model's improved image generation capabilities can still produce potentially harmful content. Load the model with the `from_pretrained()` method: ```python >>> from diffusers import DiffusionPipeline >>> pipeline = DiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", use_safetensors=True) ``` The `DiffusionPipeline` downloads and caches all modeling, tokenization, and scheduling components. You'll see that the Stable Diffusion pipeline is composed of the `UNet2DConditionModel` and `PNDMScheduler` among other things: ```py >>> pipeline StableDiffusionPipeline { "_class_name": "StableDiffusionPipeline", "_diffusers_version": "0.21.4", ..., "scheduler": [ "diffusers", "PNDMScheduler" ], ..., "unet": [ "diffusers", "UNet2DConditionModel" ], "vae": [ "diffusers", "AutoencoderKL" ] } ``` We strongly recommend running the pipeline on a GPU because the model consists of roughly 1.4 billion parameters. You can move the generator object to a GPU, just like you would in PyTorch: ```python >>> pipeline.to("cuda") ``` Now you can pass a text prompt to the `pipeline` to generate an image, and then access the denoised image. By default, the image output is wrapped in a [`PIL.Image`](https://pillow.readthedocs.io/en/stable/reference/Image.html?highlight=image#the-image-class) object. ```python >>> image = pipeline("An image of a squirrel in Picasso style").images[0] >>> image ```
Save the image by calling `save`: ```python >>> image.save("image_of_squirrel_painting.png") ``` ### Local pipeline You can also use the pipeline locally. The only difference is you need to download the weights first: ```bash !git lfs install !git clone https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5 ``` Then load the saved weights into the pipeline: ```python >>> pipeline = DiffusionPipeline.from_pretrained("./stable-diffusion-v1-5", use_safetensors=True) ``` Now, you can run the pipeline as you would in the section above. ### Swapping schedulers Different schedulers come with different denoising speeds and quality trade-offs. The best way to find out which one works best for you is to try them out! One of the main features of 🧨 Diffusers is to allow you to easily switch between schedulers. For example, to replace the default `PNDMScheduler` with the `EulerDiscreteScheduler`, load it with the `from_config()` method: ```py >>> from diffusers import EulerDiscreteScheduler >>> pipeline = DiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", use_safetensors=True) >>> pipeline.scheduler = EulerDiscreteScheduler.from_config(pipeline.scheduler.config) ``` Try generating an image with the new scheduler and see if you notice a difference! In the next section, you'll take a closer look at the components - the model and scheduler - that make up the `DiffusionPipeline` and learn how to use these components to generate an image of a cat. ## Models Most models take a noisy sample, and at each timestep it predicts the *noise residual* (other models learn to predict the previous sample directly or the velocity or [`v-prediction`](https://github.com/huggingface/diffusers/blob/5e5ce13e2f89ac45a0066cb3f369462a3cf1d9ef/src/diffusers/schedulers/scheduling_ddim.py#L110)), the difference between a less noisy image and the input image. You can mix and match models to create other diffusion systems. Models are initiated with the `from_pretrained()` method which also locally caches the model weights so it is faster the next time you load the model. For the quicktour, you'll load the `UNet2DModel`, a basic unconditional image generation model with a checkpoint trained on cat images: ```py >>> from diffusers import UNet2DModel >>> repo_id = "google/ddpm-cat-256" >>> model = UNet2DModel.from_pretrained(repo_id, use_safetensors=True) ``` To access the model parameters, call `model.config`: ```py >>> model.config ``` The model configuration is a 🧊 frozen 🧊 dictionary, which means those parameters can't be changed after the model is created. This is intentional and ensures that the parameters used to define the model architecture at the start remain the same, while other parameters can still be adjusted during inference. Some of the most important parameters are: * `sample_size`: the height and width dimension of the input sample. * `in_channels`: the number of input channels of the input sample. * `down_block_types` and `up_block_types`: the type of down- and upsampling blocks used to create the UNet architecture. * `block_out_channels`: the number of output channels of the downsampling blocks; also used in reverse order for the number of input channels of the upsampling blocks. * `layers_per_block`: the number of ResNet blocks present in each UNet block. To use the model for inference, create the image shape with random Gaussian noise. It should have a `batch` axis because the model can receive multiple random noises, a `channel` axis corresponding to the number of input channels, and a `sample_size` axis for the height and width of the image: ```py >>> import torch >>> torch.manual_seed(0) >>> noisy_sample = torch.randn(1, model.config.in_channels, model.config.sample_size, model.config.sample_size) >>> noisy_sample.shape torch.Size([1, 3, 256, 256]) ``` For inference, pass the noisy image and a `timestep` to the model. The `timestep` indicates how noisy the input image is, with more noise at the beginning and less at the end. This helps the model determine its position in the diffusion process, whether it is closer to the start or the end. Use the `sample` method to get the model output: ```py >>> with torch.no_grad(): ... noisy_residual = model(sample=noisy_sample, timestep=2).sample ``` To generate actual examples though, you'll need a scheduler to guide the denoising process. In the next section, you'll learn how to couple a model with a scheduler. ## Schedulers Schedulers manage going from a noisy sample to a less noisy sample given the model output - in this case, it is the `noisy_residual`. 🧨 Diffusers is a toolbox for building diffusion systems. While the `DiffusionPipeline` is a convenient way to get started with a pre-built diffusion system, you can also choose your own model and scheduler components separately to build a custom diffusion system. For the quicktour, you'll instantiate the `DDPMScheduler` with its `from_config()` method: ```py >>> from diffusers import DDPMScheduler >>> scheduler = DDPMScheduler.from_pretrained(repo_id) >>> scheduler DDPMScheduler { "_class_name": "DDPMScheduler", "_diffusers_version": "0.21.4", "beta_end": 0.02, "beta_schedule": "linear", "beta_start": 0.0001, "clip_sample": true, "clip_sample_range": 1.0, "dynamic_thresholding_ratio": 0.995, "num_train_timesteps": 1000, "prediction_type": "epsilon", "sample_max_value": 1.0, "steps_offset": 0, "thresholding": false, "timestep_spacing": "leading", "trained_betas": null, "variance_type": "fixed_small" } ``` 💡 Unlike a model, a scheduler does not have trainable weights and is parameter-free! Some of the most important parameters are: * `num_train_timesteps`: the length of the denoising process or, in other words, the number of timesteps required to process random Gaussian noise into a data sample. * `beta_schedule`: the type of noise schedule to use for inference and training. * `beta_start` and `beta_end`: the start and end noise values for the noise schedule. To predict a slightly less noisy image, pass the following to the scheduler's `step()` method: model output, `timestep`, and current `sample`. ```py >>> less_noisy_sample = scheduler.step(model_output=noisy_residual, timestep=2, sample=noisy_sample).prev_sample >>> less_noisy_sample.shape torch.Size([1, 3, 256, 256]) ``` The `less_noisy_sample` can be passed to the next `timestep` where it'll get even less noisy! Let's bring it all together now and visualize the entire denoising process. First, create a function that postprocesses and displays the denoised image as a `PIL.Image`: ```py >>> import PIL.Image >>> import numpy as np >>> def display_sample(sample, i): ... image_processed = sample.cpu().permute(0, 2, 3, 1) ... image_processed = (image_processed + 1.0) * 127.5 ... image_processed = image_processed.numpy().astype(np.uint8) ... image_pil = PIL.Image.fromarray(image_processed[0]) ... display(f"Image at step {i}") ... display(image_pil) ``` To speed up the denoising process, move the input and model to a GPU: ```py >>> model.to("cuda") >>> noisy_sample = noisy_sample.to("cuda") ``` Now create a denoising loop that predicts the residual of the less noisy sample, and computes the less noisy sample with the scheduler: ```py >>> import tqdm >>> sample = noisy_sample >>> for i, t in enumerate(tqdm.tqdm(scheduler.timesteps)): ... # 1. predict noise residual ... with torch.no_grad(): ... residual = model(sample, t).sample ... # 2. compute less noisy image and set x_t -> x_t-1 ... sample = scheduler.step(residual, t, sample).prev_sample ... # 3. optionally look at image ... if (i + 1) % 50 == 0: ... display_sample(sample, i + 1) ``` Sit back and watch as a cat is generated from nothing but noise! 😻
## Next steps Hopefully, you generated some cool images with 🧨 Diffusers in this quicktour! For your next steps, you can: * Train or finetune a model to generate your own images in the [training](./tutorials/basic_training) tutorial. * See example official and community [training or finetuning scripts](https://github.com/huggingface/diffusers/tree/main/examples#-diffusers-examples) for a variety of use cases. * Learn more about loading, accessing, changing, and comparing schedulers in the [Using different Schedulers](./using-diffusers/schedulers) guide. * Explore prompt engineering, speed and memory optimizations, and tips and tricks for generating higher-quality images with the [Stable Diffusion](./stable_diffusion) guide. * Dive deeper into speeding up 🧨 Diffusers with guides on [optimized PyTorch on a GPU](./optimization/fp16), and inference guides for running [Stable Diffusion on Apple Silicon (M1/M2)](./optimization/mps) and [ONNX Runtime](./optimization/onnx). # Community Projects Welcome to Community Projects. This space is dedicated to showcasing the incredible work and innovative applications created by our vibrant community using the `diffusers` library. This section aims to: - Highlight diverse and inspiring projects built with `diffusers` - Foster knowledge sharing within our community - Provide real-world examples of how `diffusers` can be leveraged Happy exploring, and thank you for being part of the Diffusers community!
Project Name Description
dream-textures Stable Diffusion built-in to Blender
HiDiffusion Increases the resolution and speed of your diffusion model by only adding a single line of code
IC-Light IC-Light is a project to manipulate the illumination of images
InstantID InstantID : Zero-shot Identity-Preserving Generation in Seconds
IOPaint Image inpainting tool powered by SOTA AI Model. Remove any unwanted object, defect, people from your pictures or erase and replace(powered by stable diffusion) any thing on your pictures.
Kohya Gradio GUI for Kohya's Stable Diffusion trainers
MagicAnimate MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model
OOTDiffusion Outfitting Fusion based Latent Diffusion for Controllable Virtual Try-on
SD.Next SD.Next: Advanced Implementation of Stable Diffusion and other Diffusion-based generative image models
stable-dreamfusion Text-to-3D & Image-to-3D & Mesh Exportation with NeRF + Diffusion
StoryDiffusion StoryDiffusion can create a magic story by generating consistent images and videos.
StreamDiffusion A Pipeline-Level Solution for Real-Time Interactive Generation
Stable Diffusion Server A server configured for Inpainting/Generation/img2img with one stable diffusion model
# Effective and efficient diffusion Getting the `DiffusionPipeline` to generate images in a certain style or include what you want can be tricky. Often times, you have to run the `DiffusionPipeline` several times before you end up with an image you're happy with. But generating something out of nothing is a computationally intensive process, especially if you're running inference over and over again. This is why it's important to get the most *computational* (speed) and *memory* (GPU vRAM) efficiency from the pipeline to reduce the time between inference cycles so you can iterate faster. This tutorial walks you through how to generate faster and better with the `DiffusionPipeline`. Begin by loading the [`stable-diffusion-v1-5/stable-diffusion-v1-5`](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5) model: ```python from diffusers import DiffusionPipeline model_id = "stable-diffusion-v1-5/stable-diffusion-v1-5" pipeline = DiffusionPipeline.from_pretrained(model_id, use_safetensors=True) ``` The example prompt you'll use is a portrait of an old warrior chief, but feel free to use your own prompt: ```python prompt = "portrait photo of a old warrior chief" ``` ## Speed 💡 If you don't have access to a GPU, you can use one for free from a GPU provider like [Colab](https://colab.research.google.com/)! One of the simplest ways to speed up inference is to place the pipeline on a GPU the same way you would with any PyTorch module: ```python pipeline = pipeline.to("cuda") ``` To make sure you can use the same image and improve on it, use a [`Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) and set a seed for [reproducibility](./using-diffusers/reusing_seeds): ```python import torch generator = torch.Generator("cuda").manual_seed(0) ``` Now you can generate an image: ```python image = pipeline(prompt, generator=generator).images[0] image ```
This process took ~30 seconds on a T4 GPU (it might be faster if your allocated GPU is better than a T4). By default, the `DiffusionPipeline` runs inference with full `float32` precision for 50 inference steps. You can speed this up by switching to a lower precision like `float16` or running fewer inference steps. Let's start by loading the model in `float16` and generate an image: ```python import torch pipeline = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16, use_safetensors=True) pipeline = pipeline.to("cuda") generator = torch.Generator("cuda").manual_seed(0) image = pipeline(prompt, generator=generator).images[0] image ```
This time, it only took ~11 seconds to generate the image, which is almost 3x faster than before! 💡 We strongly suggest always running your pipelines in `float16`, and so far, we've rarely seen any degradation in output quality. Another option is to reduce the number of inference steps. Choosing a more efficient scheduler could help decrease the number of steps without sacrificing output quality. You can find which schedulers are compatible with the current model in the `DiffusionPipeline` by calling the `compatibles` method: ```python pipeline.scheduler.compatibles [ diffusers.schedulers.scheduling_lms_discrete.LMSDiscreteScheduler, diffusers.schedulers.scheduling_unipc_multistep.UniPCMultistepScheduler, diffusers.schedulers.scheduling_k_dpm_2_discrete.KDPM2DiscreteScheduler, diffusers.schedulers.scheduling_deis_multistep.DEISMultistepScheduler, diffusers.schedulers.scheduling_euler_discrete.EulerDiscreteScheduler, diffusers.schedulers.scheduling_dpmsolver_multistep.DPMSolverMultistepScheduler, diffusers.schedulers.scheduling_ddpm.DDPMScheduler, diffusers.schedulers.scheduling_dpmsolver_singlestep.DPMSolverSinglestepScheduler, diffusers.schedulers.scheduling_k_dpm_2_ancestral_discrete.KDPM2AncestralDiscreteScheduler, diffusers.utils.dummy_torch_and_torchsde_objects.DPMSolverSDEScheduler, diffusers.schedulers.scheduling_heun_discrete.HeunDiscreteScheduler, diffusers.schedulers.scheduling_pndm.PNDMScheduler, diffusers.schedulers.scheduling_euler_ancestral_discrete.EulerAncestralDiscreteScheduler, diffusers.schedulers.scheduling_ddim.DDIMScheduler, ] ``` The Stable Diffusion model uses the `PNDMScheduler` by default which usually requires ~50 inference steps, but more performant schedulers like `DPMSolverMultistepScheduler`, require only ~20 or 25 inference steps. Use the `from_config()` method to load a new scheduler: ```python from diffusers import DPMSolverMultistepScheduler pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config) ``` Now set the `num_inference_steps` to 20: ```python generator = torch.Generator("cuda").manual_seed(0) image = pipeline(prompt, generator=generator, num_inference_steps=20).images[0] image ```
Great, you've managed to cut the inference time to just 4 seconds! ⚡️ ## Memory The other key to improving pipeline performance is consuming less memory, which indirectly implies more speed, since you're often trying to maximize the number of images generated per second. The easiest way to see how many images you can generate at once is to try out different batch sizes until you get an `OutOfMemoryError` (OOM). Create a function that'll generate a batch of images from a list of prompts and `Generators`. Make sure to assign each `Generator` a seed so you can reuse it if it produces a good result. ```python def get_inputs(batch_size=1): generator = [torch.Generator("cuda").manual_seed(i) for i in range(batch_size)] prompts = batch_size * [prompt] num_inference_steps = 20 return {"prompt": prompts, "generator": generator, "num_inference_steps": num_inference_steps} ``` Start with `batch_size=4` and see how much memory you've consumed: ```python from diffusers.utils import make_image_grid images = pipeline(**get_inputs(batch_size=4)).images make_image_grid(images, 2, 2) ``` Unless you have a GPU with more vRAM, the code above probably returned an `OOM` error! Most of the memory is taken up by the cross-attention layers. Instead of running this operation in a batch, you can run it sequentially to save a significant amount of memory. All you have to do is configure the pipeline to use the `enable_attention_slicing()` function: ```python pipeline.enable_attention_slicing() ``` Now try increasing the `batch_size` to 8! ```python images = pipeline(**get_inputs(batch_size=8)).images make_image_grid(images, rows=2, cols=4) ```
Whereas before you couldn't even generate a batch of 4 images, now you can generate a batch of 8 images at ~3.5 seconds per image! This is probably the fastest you can go on a T4 GPU without sacrificing quality. ## Quality In the last two sections, you learned how to optimize the speed of your pipeline by using `fp16`, reducing the number of inference steps by using a more performant scheduler, and enabling attention slicing to reduce memory consumption. Now you're going to focus on how to improve the quality of generated images. ### Better checkpoints The most obvious step is to use better checkpoints. The Stable Diffusion model is a good starting point, and since its official launch, several improved versions have also been released. However, using a newer version doesn't automatically mean you'll get better results. You'll still have to experiment with different checkpoints yourself, and do a little research (such as using [negative prompts](https://minimaxir.com/2022/11/stable-diffusion-negative-prompt/)) to get the best results. As the field grows, there are more and more high-quality checkpoints finetuned to produce certain styles. Try exploring the [Hub](https://huggingface.co/models?library=diffusers&sort=downloads) and [Diffusers Gallery](https://huggingface.co/spaces/huggingface-projects/diffusers-gallery) to find one you're interested in! ### Better pipeline components You can also try replacing the current pipeline components with a newer version. Let's try loading the latest [autoencoder](https://huggingface.co/stabilityai/stable-diffusion-2-1/tree/main/vae) from Stability AI into the pipeline, and generate some images: ```python from diffusers import AutoencoderKL vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse", torch_dtype=torch.float16).to("cuda") pipeline.vae = vae images = pipeline(**get_inputs(batch_size=8)).images make_image_grid(images, rows=2, cols=4) ```
### Better prompt engineering The text prompt you use to generate an image is super important, so much so that it is called *prompt engineering*. Some considerations to keep during prompt engineering are: - How is the image or similar images of the one I want to generate stored on the internet? - What additional detail can I give that steers the model towards the style I want? With this in mind, let's improve the prompt to include color and higher quality details: ```python prompt += ", tribal panther make up, blue on red, side profile, looking away, serious eyes" prompt += " 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta" ``` Generate a batch of images with the new prompt: ```python images = pipeline(**get_inputs(batch_size=8)).images make_image_grid(images, rows=2, cols=4) ```
Pretty impressive! Let's tweak the second image - corresponding to the `Generator` with a seed of `1` - a bit more by adding some text about the age of the subject: ```python prompts = [ "portrait photo of the oldest warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta", "portrait photo of an old warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta", "portrait photo of a warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta", "portrait photo of a young warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta", ] generator = [torch.Generator("cuda").manual_seed(1) for _ in range(len(prompts))] images = pipeline(prompt=prompts, generator=generator, num_inference_steps=25).images make_image_grid(images, 2, 2) ```
## Next steps In this tutorial, you learned how to optimize a `DiffusionPipeline` for computational and memory efficiency as well as improving the quality of generated outputs. If you're interested in making your pipeline even faster, take a look at the following resources: - Learn how [PyTorch 2.0](./optimization/torch2.0) and [`torch.compile`](https://pytorch.org/docs/stable/generated/torch.compile.html) can yield 5 - 300% faster inference speed. On an A100 GPU, inference can be up to 50% faster! - If you can't use PyTorch 2, we recommend you install [xFormers](./optimization/xformers). Its memory-efficient attention mechanism works great with PyTorch 1.13.1 for faster speed and reduced memory consumption. - Other optimization techniques, such as model offloading, are covered in [this guide](./optimization/fp16).



# Diffusers 🤗 Diffusers is the go-to library for state-of-the-art pretrained diffusion models for generating images, audio, and even 3D structures of molecules. Whether you're looking for a simple inference solution or want to train your own diffusion model, 🤗 Diffusers is a modular toolbox that supports both. Our library is designed with a focus on [usability over performance](conceptual/philosophy#usability-over-performance), [simple over easy](conceptual/philosophy#simple-over-easy), and [customizability over abstractions](conceptual/philosophy#tweakable-contributorfriendly-over-abstraction). The library has three main components: - State-of-the-art diffusion pipelines for inference with just a few lines of code. There are many pipelines in 🤗 Diffusers, check out the table in the pipeline [overview](api/pipelines/overview) for a complete list of available pipelines and the task they solve. - Interchangeable [noise schedulers](api/schedulers/overview) for balancing trade-offs between generation speed and quality. - Pretrained [models](api/models) that can be used as building blocks, and combined with schedulers, for creating your own end-to-end diffusion systems.
Tutorials

Learn the fundamental skills you need to start generating outputs, build your own diffusion system, and train a diffusion model. We recommend starting here if you're using 🤗 Diffusers for the first time!

How-to guides

Practical guides for helping you load pipelines, models, and schedulers. You'll also learn how to use pipelines for specific tasks, control how outputs are generated, optimize for inference speed, and different training techniques.

Conceptual guides

Understand why the library was designed the way it was, and learn more about the ethical guidelines and safety implementations for using the library.

Reference

Technical descriptions of how 🤗 Diffusers classes and methods work.

# Wuerstchen The [Wuerstchen](https://hf.co/papers/2306.00637) model drastically reduces computational costs by compressing the latent space by 42x, without compromising image quality and accelerating inference. During training, Wuerstchen uses two models (VQGAN + autoencoder) to compress the latents, and then a third model (text-conditioned latent diffusion model) is conditioned on this highly compressed space to generate an image. To fit the prior model into GPU memory and to speedup training, try enabling `gradient_accumulation_steps`, `gradient_checkpointing`, and `mixed_precision` respectively. This guide explores the [train_text_to_image_prior.py](https://github.com/huggingface/diffusers/blob/main/examples/wuerstchen/text_to_image/train_text_to_image_prior.py) script to help you become more familiar with it, and how you can adapt it for your own use-case. Before running the script, make sure you install the library from source: ```bash git clone https://github.com/huggingface/diffusers cd diffusers pip install . ``` Then navigate to the example folder containing the training script and install the required dependencies for the script you're using: ```bash cd examples/wuerstchen/text_to_image pip install -r requirements.txt ``` 🤗 Accelerate is a library for helping you train on multiple GPUs/TPUs or with mixed-precision. It'll automatically configure your training setup based on your hardware and environment. Take a look at the 🤗 Accelerate [Quick tour](https://huggingface.co/docs/accelerate/quicktour) to learn more. Initialize an 🤗 Accelerate environment: ```bash accelerate config ``` To setup a default 🤗 Accelerate environment without choosing any configurations: ```bash accelerate config default ``` Or if your environment doesn't support an interactive shell, like a notebook, you can use: ```py from accelerate.utils import write_basic_config write_basic_config() ``` Lastly, if you want to train a model on your own dataset, take a look at the [Create a dataset for training](create_dataset) guide to learn how to create a dataset that works with the training script. The following sections highlight parts of the training scripts that are important for understanding how to modify it, but it doesn't cover every aspect of the [script](https://github.com/huggingface/diffusers/blob/main/examples/wuerstchen/text_to_image/train_text_to_image_prior.py) in detail. If you're interested in learning more, feel free to read through the scripts and let us know if you have any questions or concerns. ## Script parameters The training scripts provides many parameters to help you customize your training run. All of the parameters and their descriptions are found in the [`parse_args()`](https://github.com/huggingface/diffusers/blob/6e68c71503682c8693cb5b06a4da4911dfd655ee/examples/wuerstchen/text_to_image/train_text_to_image_prior.py#L192) function. It provides default values for each parameter, such as the training batch size and learning rate, but you can also set your own values in the training command if you'd like. For example, to speedup training with mixed precision using the fp16 format, add the `--mixed_precision` parameter to the training command: ```bash accelerate launch train_text_to_image_prior.py \ --mixed_precision="fp16" ``` Most of the parameters are identical to the parameters in the [Text-to-image](text2image#script-parameters) training guide, so let's dive right into the Wuerstchen training script! ## Training script The training script is also similar to the [Text-to-image](text2image#training-script) training guide, but it's been modified to support Wuerstchen. This guide focuses on the code that is unique to the Wuerstchen training script. The [`main()`](https://github.com/huggingface/diffusers/blob/6e68c71503682c8693cb5b06a4da4911dfd655ee/examples/wuerstchen/text_to_image/train_text_to_image_prior.py#L441) function starts by initializing the image encoder - an [EfficientNet](https://github.com/huggingface/diffusers/blob/main/examples/wuerstchen/text_to_image/modeling_efficient_net_encoder.py) - in addition to the usual scheduler and tokenizer. ```py with ContextManagers(deepspeed_zero_init_disabled_context_manager()): pretrained_checkpoint_file = hf_hub_download("dome272/wuerstchen", filename="model_v2_stage_b.pt") state_dict = torch.load(pretrained_checkpoint_file, map_location="cpu") image_encoder = EfficientNetEncoder() image_encoder.load_state_dict(state_dict["effnet_state_dict"]) image_encoder.eval() ``` You'll also load the `WuerstchenPrior` model for optimization. ```py prior = WuerstchenPrior.from_pretrained(args.pretrained_prior_model_name_or_path, subfolder="prior") optimizer = optimizer_cls( prior.parameters(), lr=args.learning_rate, betas=(args.adam_beta1, args.adam_beta2), weight_decay=args.adam_weight_decay, eps=args.adam_epsilon, ) ``` Next, you'll apply some [transforms](https://github.com/huggingface/diffusers/blob/65ef7a0c5c594b4f84092e328fbdd73183613b30/examples/wuerstchen/text_to_image/train_text_to_image_prior.py#L656) to the images and [tokenize](https://github.com/huggingface/diffusers/blob/65ef7a0c5c594b4f84092e328fbdd73183613b30/examples/wuerstchen/text_to_image/train_text_to_image_prior.py#L637) the captions: ```py def preprocess_train(examples): images = [image.convert("RGB") for image in examples[image_column]] examples["effnet_pixel_values"] = [effnet_transforms(image) for image in images] examples["text_input_ids"], examples["text_mask"] = tokenize_captions(examples) return examples ``` Finally, the [training loop](https://github.com/huggingface/diffusers/blob/65ef7a0c5c594b4f84092e328fbdd73183613b30/examples/wuerstchen/text_to_image/train_text_to_image_prior.py#L656) handles compressing the images to latent space with the `EfficientNetEncoder`, adding noise to the latents, and predicting the noise residual with the `WuerstchenPrior` model. ```py pred_noise = prior(noisy_latents, timesteps, prompt_embeds) ``` If you want to learn more about how the training loop works, check out the [Understanding pipelines, models and schedulers](../using-diffusers/write_own_pipeline) tutorial which breaks down the basic pattern of the denoising process. ## Launch the script Once you’ve made all your changes or you’re okay with the default configuration, you’re ready to launch the training script! 🚀 Set the `DATASET_NAME` environment variable to the dataset name from the Hub. This guide uses the [Naruto BLIP captions](https://huggingface.co/datasets/lambdalabs/naruto-blip-captions) dataset, but you can create and train on your own datasets as well (see the [Create a dataset for training](create_dataset) guide). To monitor training progress with Weights & Biases, add the `--report_to=wandb` parameter to the training command. You’ll also need to add the `--validation_prompt` to the training command to keep track of results. This can be really useful for debugging the model and viewing intermediate results. ```bash export DATASET_NAME="lambdalabs/naruto-blip-captions" accelerate launch train_text_to_image_prior.py \ --mixed_precision="fp16" \ --dataset_name=$DATASET_NAME \ --resolution=768 \ --train_batch_size=4 \ --gradient_accumulation_steps=4 \ --gradient_checkpointing \ --dataloader_num_workers=4 \ --max_train_steps=15000 \ --learning_rate=1e-05 \ --max_grad_norm=1 \ --checkpoints_total_limit=3 \ --lr_scheduler="constant" \ --lr_warmup_steps=0 \ --validation_prompts="A robot naruto, 4k photo" \ --report_to="wandb" \ --push_to_hub \ --output_dir="wuerstchen-prior-naruto-model" ``` Once training is complete, you can use your newly trained model for inference! ```py import torch from diffusers import AutoPipelineForText2Image from diffusers.pipelines.wuerstchen import DEFAULT_STAGE_C_TIMESTEPS pipeline = AutoPipelineForText2Image.from_pretrained("path/to/saved/model", torch_dtype=torch.float16).to("cuda") caption = "A cute bird naruto holding a shield" images = pipeline( caption, width=1024, height=1536, prior_timesteps=DEFAULT_STAGE_C_TIMESTEPS, prior_guidance_scale=4.0, num_images_per_prompt=2, ).images ``` ## Next steps Congratulations on training a Wuerstchen model! To learn more about how to use your new model, the following may be helpful: - Take a look at the [Wuerstchen](../api/pipelines/wuerstchen#text-to-image-generation) API documentation to learn more about how to use the pipeline for text-to-image generation and its limitations. # Distributed inference On distributed setups, you can run inference across multiple GPUs with 🤗 [Accelerate](https://huggingface.co/docs/accelerate/index) or [PyTorch Distributed](https://pytorch.org/tutorials/beginner/dist_overview.html), which is useful for generating with multiple prompts in parallel. This guide will show you how to use 🤗 Accelerate and PyTorch Distributed for distributed inference. ## 🤗 Accelerate 🤗 [Accelerate](https://huggingface.co/docs/accelerate/index) is a library designed to make it easy to train or run inference across distributed setups. It simplifies the process of setting up the distributed environment, allowing you to focus on your PyTorch code. To begin, create a Python file and initialize an [accelerate.PartialState](https://huggingface.co/docs/accelerate/main/en/package_reference/state#accelerate.PartialState) to create a distributed environment; your setup is automatically detected so you don't need to explicitly define the `rank` or `world_size`. Move the `DiffusionPipeline` to `distributed_state.device` to assign a GPU to each process. Now use the [split_between_processes](https://huggingface.co/docs/accelerate/main/en/package_reference/state#accelerate.PartialState.split_between_processes) utility as a context manager to automatically distribute the prompts between the number of processes. ```py import torch from accelerate import PartialState from diffusers import DiffusionPipeline pipeline = DiffusionPipeline.from_pretrained( "stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True ) distributed_state = PartialState() pipeline.to(distributed_state.device) with distributed_state.split_between_processes(["a dog", "a cat"]) as prompt: result = pipeline(prompt).images[0] result.save(f"result_{distributed_state.process_index}.png") ``` Use the `--num_processes` argument to specify the number of GPUs to use, and call `accelerate launch` to run the script: ```bash accelerate launch run_distributed.py --num_processes=2 ``` Refer to this minimal example [script](https://gist.github.com/sayakpaul/cfaebd221820d7b43fae638b4dfa01ba) for running inference across multiple GPUs. To learn more, take a look at the [Distributed Inference with 🤗 Accelerate](https://huggingface.co/docs/accelerate/en/usage_guides/distributed_inference#distributed-inference-with-accelerate) guide. ## PyTorch Distributed PyTorch supports [`DistributedDataParallel`](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html) which enables data parallelism. To start, create a Python file and import `torch.distributed` and `torch.multiprocessing` to set up the distributed process group and to spawn the processes for inference on each GPU. You should also initialize a `DiffusionPipeline`: ```py import torch import torch.distributed as dist import torch.multiprocessing as mp from diffusers import DiffusionPipeline sd = DiffusionPipeline.from_pretrained( "stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True ) ``` You'll want to create a function to run inference; [`init_process_group`](https://pytorch.org/docs/stable/distributed.html?highlight=init_process_group#torch.distributed.init_process_group) handles creating a distributed environment with the type of backend to use, the `rank` of the current process, and the `world_size` or the number of processes participating. If you're running inference in parallel over 2 GPUs, then the `world_size` is 2. Move the `DiffusionPipeline` to `rank` and use `get_rank` to assign a GPU to each process, where each process handles a different prompt: ```py def run_inference(rank, world_size): dist.init_process_group("nccl", rank=rank, world_size=world_size) sd.to(rank) if torch.distributed.get_rank() == 0: prompt = "a dog" elif torch.distributed.get_rank() == 1: prompt = "a cat" image = sd(prompt).images[0] image.save(f"./{'_'.join(prompt)}.png") ``` To run the distributed inference, call [`mp.spawn`](https://pytorch.org/docs/stable/multiprocessing.html#torch.multiprocessing.spawn) to run the `run_inference` function on the number of GPUs defined in `world_size`: ```py def main(): world_size = 2 mp.spawn(run_inference, args=(world_size,), nprocs=world_size, join=True) if __name__ == "__main__": main() ``` Once you've completed the inference script, use the `--nproc_per_node` argument to specify the number of GPUs to use and call `torchrun` to run the script: ```bash torchrun run_distributed.py --nproc_per_node=2 ``` > [!TIP] > You can use `device_map` within a `DiffusionPipeline` to distribute its model-level components on multiple devices. Refer to the [Device placement](../tutorials/inference_with_big_models#device-placement) guide to learn more. ## Model sharding Modern diffusion systems such as [Flux](../api/pipelines/flux) are very large and have multiple models. For example, [Flux.1-Dev](https://hf.co/black-forest-labs/FLUX.1-dev) is made up of two text encoders - [T5-XXL](https://hf.co/google/t5-v1_1-xxl) and [CLIP-L](https://hf.co/openai/clip-vit-large-patch14) - a [diffusion transformer](../api/models/flux_transformer), and a [VAE](../api/models/autoencoderkl). With a model this size, it can be challenging to run inference on consumer GPUs. Model sharding is a technique that distributes models across GPUs when the models don't fit on a single GPU. The example below assumes two 16GB GPUs are available for inference. Start by computing the text embeddings with the text encoders. Keep the text encoders on two GPUs by setting `device_map="balanced"`. The `balanced` strategy evenly distributes the model on all available GPUs. Use the `max_memory` parameter to allocate the maximum amount of memory for each text encoder on each GPU. > [!TIP] > **Only** load the text encoders for this step! The diffusion transformer and VAE are loaded in a later step to preserve memory. ```py from diffusers import FluxPipeline import torch prompt = "a photo of a dog with cat-like look" pipeline = FluxPipeline.from_pretrained( "black-forest-labs/FLUX.1-dev", transformer=None, vae=None, device_map="balanced", max_memory={0: "16GB", 1: "16GB"}, torch_dtype=torch.bfloat16 ) with torch.no_grad(): print("Encoding prompts.") prompt_embeds, pooled_prompt_embeds, text_ids = pipeline.encode_prompt( prompt=prompt, prompt_2=None, max_sequence_length=512 ) ``` Once the text embeddings are computed, remove them from the GPU to make space for the diffusion transformer. ```py import gc def flush(): gc.collect() torch.cuda.empty_cache() torch.cuda.reset_max_memory_allocated() torch.cuda.reset_peak_memory_stats() del pipeline.text_encoder del pipeline.text_encoder_2 del pipeline.tokenizer del pipeline.tokenizer_2 del pipeline flush() ``` Load the diffusion transformer next which has 12.5B parameters. This time, set `device_map="auto"` to automatically distribute the model across two 16GB GPUs. The `auto` strategy is backed by [Accelerate](https://hf.co/docs/accelerate/index) and available as a part of the [Big Model Inference](https://hf.co/docs/accelerate/concept_guides/big_model_inference) feature. It starts by distributing a model across the fastest device first (GPU) before moving to slower devices like the CPU and hard drive if needed. The trade-off of storing model parameters on slower devices is slower inference latency. ```py from diffusers import FluxTransformer2DModel import torch transformer = FluxTransformer2DModel.from_pretrained( "black-forest-labs/FLUX.1-dev", subfolder="transformer", device_map="auto", torch_dtype=torch.bfloat16 ) ``` > [!TIP] > At any point, you can try `print(pipeline.hf_device_map)` to see how the various models are distributed across devices. This is useful for tracking the device placement of the models. You can also try `print(transformer.hf_device_map)` to see how the transformer model is sharded across devices. Add the transformer model to the pipeline for denoising, but set the other model-level components like the text encoders and VAE to `None` because you don't need them yet. ```py pipeline = FluxPipeline.from_pretrained( "black-forest-labs/FLUX.1-dev", text_encoder=None, text_encoder_2=None, tokenizer=None, tokenizer_2=None, vae=None, transformer=transformer, torch_dtype=torch.bfloat16 ) print("Running denoising.") height, width = 768, 1360 latents = pipeline( prompt_embeds=prompt_embeds, pooled_prompt_embeds=pooled_prompt_embeds, num_inference_steps=50, guidance_scale=3.5, height=height, width=width, output_type="latent", ).images ``` Remove the pipeline and transformer from memory as they're no longer needed. ```py del pipeline.transformer del pipeline flush() ``` Finally, decode the latents with the VAE into an image. The VAE is typically small enough to be loaded on a single GPU. ```py from diffusers import AutoencoderKL from diffusers.image_processor import VaeImageProcessor import torch vae = AutoencoderKL.from_pretrained(ckpt_id, subfolder="vae", torch_dtype=torch.bfloat16).to("cuda") vae_scale_factor = 2 ** (len(vae.config.block_out_channels)) image_processor = VaeImageProcessor(vae_scale_factor=vae_scale_factor) with torch.no_grad(): print("Running decoding.") latents = FluxPipeline._unpack_latents(latents, height, width, vae_scale_factor) latents = (latents / vae.config.scaling_factor) + vae.config.shift_factor image = vae.decode(latents, return_dict=False)[0] image = image_processor.postprocess(image, output_type="pil") image[0].save("split_transformer.png") ``` By selectively loading and unloading the models you need at a given stage and sharding the largest models across multiple GPUs, it is possible to run inference with large models on consumer GPUs. # CogVideoX CogVideoX is a text-to-video generation model focused on creating more coherent videos aligned with a prompt. It achieves this using several methods. - a 3D variational autoencoder that compresses videos spatially and temporally, improving compression rate and video accuracy. - an expert transformer block to help align text and video, and a 3D full attention module for capturing and creating spatially and temporally accurate videos. The actual test of the video instruction dimension found that CogVideoX has good effects on consistent theme, dynamic information, consistent background, object information, smooth motion, color, scene, appearance style, and temporal style but cannot achieve good results with human action, spatial relationship, and multiple objects. Finetuning with Diffusers can help make up for these poor results. ## Data Preparation The training scripts accepts data in two formats. The first format is suited for small-scale training, and the second format uses a CSV format, which is more appropriate for streaming data for large-scale training. In the future, Diffusers will support the `