Diffusers documentation

Shap-E

You are viewing v0.18.2 version. A newer version v0.32.1 is available.
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Shap-E

Overview

The Shap-E model was proposed in Shap-E: Generating Conditional 3D Implicit Functions by Alex Nichol and Heewon Jun from OpenAI.

The abstract of the paper is the following:

We present Shap-E, a conditional generative model for 3D assets. Unlike recent work on 3D generative models which produce a single output representation, Shap-E directly generates the parameters of implicit functions that can be rendered as both textured meshes and neural radiance fields. We train Shap-E in two stages: first, we train an encoder that deterministically maps 3D assets into the parameters of an implicit function; second, we train a conditional diffusion model on outputs of the encoder. When trained on a large dataset of paired 3D and text data, our resulting models are capable of generating complex and diverse 3D assets in a matter of seconds. When compared to Point-E, an explicit generative model over point clouds, Shap-E converges faster and reaches comparable or better sample quality despite modeling a higher-dimensional, multi-representation output space.

The original codebase can be found here.

Available Pipelines:

Pipeline Tasks
pipeline_shap_e.py Text-to-Image Generation
pipeline_shap_e_img2img.py Image-to-Image Generation

Available checkpoints

Usage Examples

In the following, we will walk you through some examples of how to use Shap-E pipelines to create 3D objects in gif format.

Text-to-3D image generation

We can use ShapEPipeline to create 3D object based on a text prompt. In this example, we will make a birthday cupcake for :firecracker: diffusers library’s 1 year birthday. The workflow to use the Shap-E text-to-image pipeline is same as how you would use other text-to-image pipelines in diffusers.

import torch

from diffusers import DiffusionPipeline

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

repo = "openai/shap-e"
pipe = DiffusionPipeline.from_pretrained(repo, torch_dtype=torch.float16)
pipe = pipe.to(device)

guidance_scale = 15.0
prompt = ["A firecracker", "A birthday cupcake"]

images = pipe(
    prompt,
    guidance_scale=guidance_scale,
    num_inference_steps=64,
    frame_size=256,
).images

The output of ShapEPipeline is a list of lists of images frames. Each list of frames can be used to create a 3D object. Let’s use the export_to_gif utility function in diffusers to make a 3D cupcake!

from diffusers.utils import export_to_gif

export_to_gif(images[0], "firecracker_3d.gif")
export_to_gif(images[1], "cake_3d.gif")

img img

Image-to-Image generation

You can use ShapEImg2ImgPipeline along with other text-to-image pipelines in diffusers and turn your 2D generation into 3D.

In this example, We will first genrate a cheeseburger with a simple prompt “A cheeseburger, white background”

from diffusers import DiffusionPipeline
import torch

pipe_prior = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16)
pipe_prior.to("cuda")

t2i_pipe = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16)
t2i_pipe.to("cuda")

prompt = "A cheeseburger, white background"

image_embeds, negative_image_embeds = pipe_prior(prompt, guidance_scale=1.0).to_tuple()
image = t2i_pipe(
    prompt,
    image_embeds=image_embeds,
    negative_image_embeds=negative_image_embeds,
).images[0]

image.save("burger.png")

img

we will then use the Shap-E image-to-image pipeline to turn it into a 3D cheeseburger :)

from PIL import Image
from diffusers.utils import export_to_gif

repo = "openai/shap-e-img2img"
pipe = DiffusionPipeline.from_pretrained(repo, torch_dtype=torch.float16)
pipe = pipe.to("cuda")

guidance_scale = 3.0
image = Image.open("burger.png").resize((256, 256))

images = pipe(
    image,
    guidance_scale=guidance_scale,
    num_inference_steps=64,
    frame_size=256,
).images

gif_path = export_to_gif(images[0], "burger_3d.gif")

img

ShapEPipeline

class diffusers.ShapEPipeline

< >

( prior: PriorTransformer text_encoder: CLIPTextModelWithProjection tokenizer: CLIPTokenizer scheduler: HeunDiscreteScheduler renderer: ShapERenderer )

Parameters

  • prior (PriorTransformer) — The canonincal unCLIP prior to approximate the image embedding from the text embedding.
  • text_encoder (CLIPTextModelWithProjection) — Frozen text-encoder.
  • tokenizer (CLIPTokenizer) — Tokenizer of class CLIPTokenizer.
  • scheduler (HeunDiscreteScheduler) — A scheduler to be used in combination with prior to generate image embedding.
  • renderer (ShapERenderer) — Shap-E renderer projects the generated latents into parameters of a MLP that’s used to create 3D objects with the NeRF rendering method

Pipeline for generating latent representation of a 3D asset and rendering with NeRF method with Shap-E

This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods the library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)

__call__

< >

( prompt: str num_images_per_prompt: int = 1 num_inference_steps: int = 25 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.FloatTensor] = None guidance_scale: float = 4.0 frame_size: int = 64 output_type: typing.Optional[str] = 'pil' return_dict: bool = True ) ShapEPipelineOutput or tuple

Parameters

  • prompt (str or List[str]) — The prompt or prompts to guide the image generation.
  • num_images_per_prompt (int, optional, defaults to 1) — The number of images to generate per prompt.
  • num_inference_steps (int, optional, defaults to 25) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.
  • generator (torch.Generator or List[torch.Generator], optional) — One or a list of torch generator(s) to make generation deterministic.
  • latents (torch.FloatTensor, optional) — Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will ge generated by sampling using the supplied random generator.
  • guidance_scale (float, optional, defaults to 4.0) — Guidance scale as defined in Classifier-Free Diffusion Guidance. guidance_scale is defined as w of equation 2. of Imagen Paper. Guidance scale is enabled by setting guidance_scale > 1. Higher guidance scale encourages to generate images that are closely linked to the text prompt, usually at the expense of lower image quality.
  • frame_size (int, optional, default to 64) — the width and height of each image frame of the generated 3d output
  • output_type (str, optional, defaults to "pt") — The output format of the generate image. Choose between: "np" (np.array) or "pt" (torch.Tensor).
  • return_dict (bool, optional, defaults to True) — Whether or not to return a ImagePipelineOutput instead of a plain tuple.

Returns

ShapEPipelineOutput or tuple

Function invoked when calling the pipeline for generation.

Examples:

>>> import torch
>>> from diffusers import DiffusionPipeline
>>> from diffusers.utils import export_to_gif

>>> device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

>>> repo = "openai/shap-e"
>>> pipe = DiffusionPipeline.from_pretrained(repo, torch_dtype=torch.float16)
>>> pipe = pipe.to(device)

>>> guidance_scale = 15.0
>>> prompt = "a shark"

>>> images = pipe(
...     prompt,
...     guidance_scale=guidance_scale,
...     num_inference_steps=64,
...     frame_size=256,
... ).images

>>> gif_path = export_to_gif(images[0], "shark_3d.gif")

enable_model_cpu_offload

< >

( gpu_id = 0 )

Offloads all models to CPU using accelerate, reducing memory usage with a low impact on performance. Compared to enable_sequential_cpu_offload, this method moves one whole model at a time to the GPU when its forward method is called, and the model remains in GPU until the next model runs. Memory savings are lower than with enable_sequential_cpu_offload, but performance is much better due to the iterative execution of the unet.

enable_sequential_cpu_offload

< >

( gpu_id = 0 )

Offloads all models to CPU using accelerate, significantly reducing memory usage. When called, the pipeline’s models have their state dicts saved to CPU and then are moved to a torch.device('meta') and loaded to GPU only when their specific submodule has its forward` method called.

ShapEImg2ImgPipeline

class diffusers.ShapEImg2ImgPipeline

< >

( prior: PriorTransformer image_encoder: CLIPVisionModel image_processor: CLIPImageProcessor scheduler: HeunDiscreteScheduler renderer: ShapERenderer )

Parameters

  • prior (PriorTransformer) — The canonincal unCLIP prior to approximate the image embedding from the text embedding.
  • text_encoder (CLIPTextModelWithProjection) — Frozen text-encoder.
  • tokenizer (CLIPTokenizer) — Tokenizer of class CLIPTokenizer.
  • scheduler (HeunDiscreteScheduler) — A scheduler to be used in combination with prior to generate image embedding.
  • renderer (ShapERenderer) — Shap-E renderer projects the generated latents into parameters of a MLP that’s used to create 3D objects with the NeRF rendering method

Pipeline for generating latent representation of a 3D asset and rendering with NeRF method with Shap-E

This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods the library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)

__call__

< >

( image: typing.Union[PIL.Image.Image, typing.List[PIL.Image.Image]] num_images_per_prompt: int = 1 num_inference_steps: int = 25 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.FloatTensor] = None guidance_scale: float = 4.0 frame_size: int = 64 output_type: typing.Optional[str] = 'pil' return_dict: bool = True ) ShapEPipelineOutput or tuple

Parameters

  • prompt (str or List[str]) — The prompt or prompts to guide the image generation.
  • num_images_per_prompt (int, optional, defaults to 1) — The number of images to generate per prompt.
  • num_inference_steps (int, optional, defaults to 100) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.
  • generator (torch.Generator or List[torch.Generator], optional) — One or a list of torch generator(s) to make generation deterministic.
  • latents (torch.FloatTensor, optional) — Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will ge generated by sampling using the supplied random generator.
  • guidance_scale (float, optional, defaults to 4.0) — Guidance scale as defined in Classifier-Free Diffusion Guidance. guidance_scale is defined as w of equation 2. of Imagen Paper. Guidance scale is enabled by setting guidance_scale > 1. Higher guidance scale encourages to generate images that are closely linked to the text prompt, usually at the expense of lower image quality.
  • frame_size (int, optional, default to 64) — the width and height of each image frame of the generated 3d output
  • output_type (str, optional, defaults to "pt") — The output format of the generate image. Choose between: "np" (np.array) or "pt" (torch.Tensor).
  • return_dict (bool, optional, defaults to True) — Whether or not to return a ImagePipelineOutput instead of a plain tuple.

Returns

ShapEPipelineOutput or tuple

Function invoked when calling the pipeline for generation.

Examples:

>>> from PIL import Image
>>> import torch
>>> from diffusers import DiffusionPipeline
>>> from diffusers.utils import export_to_gif, load_image

>>> device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

>>> repo = "openai/shap-e-img2img"
>>> pipe = DiffusionPipeline.from_pretrained(repo, torch_dtype=torch.float16)
>>> pipe = pipe.to(device)

>>> guidance_scale = 3.0
>>> image_url = "https://hf.co/datasets/diffusers/docs-images/resolve/main/shap-e/corgi.png"
>>> image = load_image(image_url).convert("RGB")

>>> images = pipe(
...     image,
...     guidance_scale=guidance_scale,
...     num_inference_steps=64,
...     frame_size=256,
... ).images

>>> gif_path = export_to_gif(images[0], "corgi_3d.gif")

enable_sequential_cpu_offload

< >

( gpu_id = 0 )

Offloads all models to CPU using accelerate, significantly reducing memory usage. When called, the pipeline’s models have their state dicts saved to CPU and then are moved to a torch.device('meta') and loaded to GPU only when their specific submodule has its forward` method called.