Why 12b? Who could run that locally?

#1
by kaidu88 - opened

The model looks good for sure, but why is there only a 12b model? :(
Even with the best consumer hardware you could barely load this model into vram.

Any plans of making a smaller or distilled model with, e.g. 2b-4b, that could run on 24gb vram?

This comment has been hidden

why do you people complain about open source models being awful and then expect it to be good at 1B parameters? have you ever used a 1B LLM compared to even a 7B LLM?

quantize to 8 bit -> runs on 12gb -> runs on your 3060

I have a 3090 with 24gb vram. But 12b parameters in float16 format are still ~24GB and this does not include the two text encoders nor the internal state of the model.

quantizing does not work for image models as it does for llms. At least for all image models I tried so far. Maybe this is the only exception but I would be surprised. Besides that, llms are trained on MUCH larger datasets than image models. I doubt that a 12b image model is really better than a 4b image model - we just don't have enough training data for that. PixArt Alpha is a nice example where a 0.6b model outperforms 2b models with ease.
Besides that, even for llms we nowadays go more and more into the direction of using llms that fit into consumer hardware. So yes, people prefer 7b llms to 400b llms for most tasks, as they are more efficient, run on consumer hardware and are good enough for most of the tasks. I'm pretty sure there is much space for improvement from the current sota open source models like SDXL, Würstchen, PixArt and so on to a model that still fits into the vram of consumer hardware.

you can quantize to 8 bit and lose nothing dawg. they're selling a service here so why would they skimp on the vram to please people who AREN'T gonna pay? take what you're given

lmao image gen people finally know how we feel

I'm happy to try out quantization Sayak. Any idea when flux will be supported in diffusers?

The quantization we use for image models is really primitive compared to LLMs, I think because users weren't as desperate, lol. Naive FP8 rounding/quantization destroys LLMs too.

I'm sure people will cram it in 3090s with more advanced schemes.

PR is open. Will be merged shortly.

I doubt that a 12b image model is really better than a 4b image model - we just don't have enough training data for that. PixArt Alpha is a nice example where a 0.6b model outperforms 2b models with ease.

pixart uses T5

The text encoder can be processed independently of the model - that's fine. I don't care so much about the size of the text encoders but about the size of the diffusion transformer. Seems like you can quantize this to 8bit without too much loss. I still see not big chances that we can finetune it on consumer hardware, even with loras it will be hard. I like the model - it would be still nice to have a smaller variant even if that would be slightly worse in terms of quality.

Wait, how are you guys having problems running this on a 3090? I'm on one and it runs FINE. I wouldn't want any less than the best so I'm glad it's 12b.

Tried the schnell version. I just did what Comfy said on his Flux example page and it works with 12gb vram without any problems. The only thing that looks scary is my system ram going up to nearly 32/32gb when it loads the model lol. I tried both the default and fp8 setting. Don't see a difference in quality honestly. But I think if there will be any kind of controlnet or loras it would be too much for 12gb. At least as of now, probably some smart people will come up with anything to reduce vram requirements. They always do :D I will test the dev version aswell but I'm too lazy to download another model rn

Did someone managed to run it on MacOs? Looks like it's trying to use around 50gb of RAM because of bf16

Here is my script for running it in <16gb VRAM.

https://gist.github.com/AmericanPresidentJimmyCarter/873985638e1f3541ba8b00137e7dacd9

thx for your service , Ser
only change i had to do as of now was to use pip install git+https://github.com/huggingface/diffusers.git@27637a5

I tested it on WSL2 win11 16GB VRAM

test_flux_distilled3.png
test_flux_distilled2.png
test_flux_distilled1.png
test_flux_distilled0.png

Is there any way to tell the model to load across 2x GPUs? I've got dual 3090s and it wasn't something I was able to ChatGPT easily.

Is there any way to tell the model to load across 2x GPUs? I've got dual 3090s and it wasn't something I was able to ChatGPT easily.

Idk about splitting the whole model across two gpus but you could just put the text encoders on one 3090 and the diffusion model on the other, should be able to run it in fp16 that way

llm people: first time?

Working fine on 24GB

from transformers import T5EncoderModel
import time
import gc
import torch
import diffusers

def flush():
    gc.collect()
    torch.cuda.empty_cache()

t5_encoder = T5EncoderModel.from_pretrained(
    "black-forest-labs/FLUX.1-schnell", subfolder="text_encoder_2", revision="refs/pr/7", torch_dtype=torch.bfloat16
)
text_encoder = diffusers.DiffusionPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-schnell",
    text_encoder_2=t5_encoder,
    transformer=None,
    vae=None,
    revision="refs/pr/7",
)
pipeline = diffusers.DiffusionPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-schnell", 
    torch_dtype=torch.bfloat16,
    revision="refs/pr/1",
    text_encoder_2=None,
    text_encoder=None,
)
pipeline.enable_model_cpu_offload()



@torch
	.inference_mode()
def inference(self, prompt, num_inference_steps=4, guidance_scale=0.0, width=1024, height=1024):
    self.text_encoder.to("cuda")
    start = time.time()
    (
        prompt_embeds,
        pooled_prompt_embeds,
        _,
    ) = self.text_encoder.encode_prompt(prompt=prompt, prompt_2=None, max_sequence_length=256)
    self.text_encoder.to("cpu")
    flush()
    print(f"Prompt encoding time: {time.time() - start}")
    output = self.pipeline(
        prompt_embeds=prompt_embeds.bfloat16(),
        pooled_prompt_embeds=pooled_prompt_embeds.bfloat16(),
        width=width,
        height=height,
        guidance_scale=guidance_scale,
        num_inference_steps=num_inference_steps
    )
    image = output.images[0]
    return image

Fp8 thanks to Kijai: https://huggingface.co/Kijai/flux-fp8

Anyone made proper FP8 for schnell like Kijai?

Kijai dev works perfect

Fp8 thanks to Kijai: https://huggingface.co/Kijai/flux-fp8

Anyone made proper FP8 for schnell like Kijai?

Kijai dev works perfect

If you switch flux1-dev to fp8_e4m3fn in the weight type it seems to work nicely with lcm or lpndm as a sampler on 4 steps.
ComfyUI_00111_.png

Fp8 thanks to Kijai: https://huggingface.co/Kijai/flux-fp8

Anyone made proper FP8 for schnell like Kijai?

Kijai dev works perfect

If you switch flux1-dev to fp8_e4m3fn in the weight type it seems to work nicely with lcm or lpndm as a sampler on 4 steps.
ComfyUI_00111_.png

Yes already SwarmUI runs it at fp8 default

But I am making auto installer for a big tutorial and people will save 11gb file size

By the way I found fp8 and fixed metada and works :)

thanks so much for all the work!

Fp8 thanks to Kijai: https://huggingface.co/Kijai/flux-fp8

Anyone made proper FP8 for schnell like Kijai?

Kijai dev works perfect

If you switch flux1-dev to fp8_e4m3fn in the weight type it seems to work nicely with lcm or lpndm as a sampler on 4 steps.

Your workflow (embedded in the image) is using the schnell model not the dev model. The schnell model works w/ 1 to 4 steps out of the box (don't need to use LCM sampler).

If I'm missing something and you have a way to generate proper images w/ the Dev model using only 4 steps please elaborate.

Fp8 thanks to Kijai: https://huggingface.co/Kijai/flux-fp8

Anyone made proper FP8 for schnell like Kijai?

Kijai dev works perfect

If you switch flux1-dev to fp8_e4m3fn in the weight type it seems to work nicely with lcm or lpndm as a sampler on 4 steps.

Your workflow (embedded in the image) is using the schnell model not the dev model. The schnell model works w/ 1 to 4 steps out of the box (don't need to use LCM sampler).

If I'm missing something and you have a way to generate proper images w/ the Dev model using only 4 steps please elaborate.

OMG... you are right. This happens when Comfy opens a second window, where I set up the other one with schnell. Dev did not do that. Sorry for the wrong input! It's only a blurry mess.

Is there any way to tell the model to load across 2x GPUs? I've got dual 3090s and it wasn't something I was able to ChatGPT easily.

@freeqaz Loaded across my 2x 3090s using WSL. It seems to be using more of GPU 0

import torch
from diffusers import FluxPipeline

pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16, device_map='balanced')
# pipe.enable_model_cpu_offload()  # save some VRAM by offloading the model to CPU. Remove this if you have enough GPU power

prompt = '''A dog holding up a sign with a rainbow in it, reading "OP"'''
image = pipe(
    prompt,
    height=512,
    width=512,
    guidance_scale=3.5,
    output_type="pil",
    num_inference_steps=50,
    max_sequence_length=512,
    generator=torch.Generator("cpu").manual_seed(0)
).images[0]
# image.show()
image.save("flux-dev.png")

Example Image

is the a fp4 for flux dev maybe? I have an 8GB GPU.

https://gist.github.com/sayakpaul/b664605caf0aa3bf8585ab109dd5ac9c

hey sayak how can we run flux schnell in fp4? do you have a code example?

I also want to try on 8 GB VRAM. Please share code.

I have published a full tutorial and FLUX models as low as 6 GB VRAM out of box with SwarmUI

https://youtu.be/bupRePUOA18

FLUX: The First Ever Open Source txt2img Model Truly Beats Midjourney & Others - FLUX is Awaited SD3

image

I have published a full tutorial and FLUX models as low as 6 GB VRAM out of box with SwarmUI

https://youtu.be/bupRePUOA18

Install files are locked behind a Patreon Paywall, unfortunately. 😥

I have published a full tutorial and FLUX models as low as 6 GB VRAM out of box with SwarmUI

https://youtu.be/bupRePUOA18

Install files are locked behind a Patreon Paywall, unfortunately. 😥

model links are on the post so it is just automated process is behind patreon post. you can fully manually download . i have shared instructions as well

is the a fp4 for flux dev maybe? I have an 8GB GPU.

I think you actually can run fp8 with 8GB vram, just need to offload a bit to cpu and ram. If you use comfyui it will automatically do that.

is the a fp4 for flux dev maybe? I have an 8GB GPU.

SwarmUI lets you run as low as 6 gb vram out of box you can see my tutorial link shared above

Hi guys, here an alternative interface forked from the Hugging Face one, but with the quantized FP8 model so you can run on 16GB VRAM. Plus some other stuff like automatic saving of the images (PNG and WEBP), image metadata, etc.

https://github.com/Neurone/flux.1-dev-fp8

Hi guys, here an alternative interface forked from the Hugging Face one, but with the quantized FP8 model so you can run on 16GB VRAM. Plus some other stuff like automatic saving of the images (PNG and WEBP), image metadata, etc.

https://github.com/Neurone/flux.1-dev-fp8

@neurone hey how much ram does it use? on Comfyui I had peaks of 37GB

There are now 4bit quants that bring it down to 6.7GB. 2 weeks.

There are now 4bit quants that bring it down to 6.7GB. 2 weeks.

I wonder if that would be feasible, especially that I am running on 8 gigabyte VRam and it takes ages to render. And even if it is feasible, can you please say How would that be? And if there is any way I could contribute?

@Aryanne it peaks around 30-32 GB RAM at startup, but it decreases to ~20GB once started. I don't know if that is always the case, though. On the contrary of VRAM, RAM data can be temporarly transferred to disk (swap partition/file) in case you don't have enough, so often programs decide how much RAM to allocate based on your total amount of RAM (I have 64GB of RAM).

About the 4bit quantization, that sounds cool and it would be interesting to understand how much details you loose during the inference.

Here you can see an example of the same inference parameters done by the flux.1-dev and flux.1-dev-fp8: (https://github.com/Neurone/flux.1-dev-fp8?tab=readme-ov-file#model-comparison)
Flux.1-dev-fp8 still produce a beautiful image, but you can actually see it lacks of some "nice" details and you can notice differences like:

  • fewer stars
  • no necklace
  • less feathers in the wings
  • less shadows in general (so some elements seems less "deep")
  • right foot shorter than the left one (the flux.1-dev makes them of the same size)
  • more "rigid" poses of the characters, especially the head of the central angel and the young angel on the left

If you go even lower, I assume the differences will start to be even more evident. If someone wants to try, it would be nice to post here the input parameters (prompt, seed, CFG, steps, width, height) so we can compare the results!

@Shehab007 Here you can see the performance I'm experiencing with my configuration and the FP8 model. https://github.com/Neurone/flux.1-dev-fp8?tab=readme-ov-file#performance
It would be interesting to add some more data if you are going to try the 4FP version :)

quantizing does not work for image models as it does for llms. At least for all image models I tried so far. Maybe this is the only exception but I would be surprised. Besides that, llms are trained on MUCH larger datasets than image models. I doubt that a 12b image model is really better than a 4b image model - we just don't have enough training data for that. PixArt Alpha is a nice example where a 0.6b model outperforms 2b models with ease.
Besides that, even for llms we nowadays go more and more into the direction of using llms that fit into consumer hardware. So yes, people prefer 7b llms to 400b llms for most tasks, as they are more efficient, run on consumer hardware and are good enough for most of the tasks. I'm pretty sure there is much space for improvement from the current sota open source models like SDXL, Würstchen, PixArt and so on to a model that still fits into the vram of consumer hardware.

this didnt age well LOL turned out it works EXACTLY like it does in LLM....

when u train ur character Lora in Fal and use it their playground the character resemble 100% but when u download the trained lora file and use it ur locally installed Comfui with the trigger word it looks only 50% similir with lora strenth of 6 and if u go higher then its a big mess. why is this happening ? using Flux- dev with RTX 3090

Can I sell images generated by flux dev?

Sign up or log in to comment