external CLIP vs internal; VRAM utilization question

#2
by froilo - opened

a)Is the internal clip fp16?

b)How is so little VRAM used (onle around 4GB) when the text encoder and weights have 23gb
Can things be sped up if more VRAM is utilized?

pretty nice at 4 steps (no upscale)

image.png

internal clip was baked at fp8 as to optimise / reduce computational requirements for people with low VRam, the model's weights were merged in such a way to allow generations at 4 steps,
as far as i know you cannot speed it up, it only takes 4 seconds to generate 1024 x 1024 on 24GB VRAM

drbaph changed discussion status to closed

Hate to post in a closed topic but I'm not sure the T5 weights in this checkpoint are actually FP8. Transformer +CLIP+T5+VAE checkpoints for Flux that are FP8 should be ~17GB, the +4GB makes me think the T5 was saved as FP16 while the transformer was saved as FP8. See https://huggingface.co/Comfy-Org/flux1-dev/tree/main as an example.

ok that corroborates my finings that generations were the same with external fp16 and the internal

I'm confident i baked the t5xxl_fp8_e4m3fn transformer into this model merge.
The resulting file size will depend on the type of merge [formula] and the quant map.
Every merge will have a different formula and file size - i used the formula provided by Kijai for quantization,and comfyorgs tip on the 4 step merge.
Prior to merging, the quantized the models using Kijai’s formula, resulted in two 12 gb files.
Additionally, I baked the VAE and CLIP fp8 components into the models wich weight extra 4.5 gb + 319mb
The T5 model in fp16 versus fp8 doesn’t show a significant difference in generation, so many people prefer using the FP8 model for its greater computational efficiency during generation.
You could also perform a test on the model to observe VRAM usage during inference to assess its efficiency on an 8gb Card vs an external fp16 t5 clip .

Ah, I think I get it now. I didn't notice the merging tip part before.

The T5 model in fp16 versus fp8 doesn’t show a significant difference in generation, so many people prefer using the FP8 model for its greater computational efficiency during generation.
gud
You could also perform a test on the model to observe VRAM usage during inference to assess its efficiency on an 8gb Card vs an external fp16 t5 clip .
i did not

Sign up or log in to comment