Diffusors missing config.json file

#3
by jspaun - opened

Hi,

I'm running this code in a python script and when I try to load the text encoder model, I get an error that says missing config.json file.

I tried just putting the OpenAI HF config there, as it's for the same model. It's possible that they changed something, but please try and see if it works. If there's an error about the config, I'd appreciate the log of that so I can go and figure out how to make this config. :)

Thanks for adding the config, now I'm getting an error that more files are missing
OSError: Error no file named pytorch_model.bin, model.safetensors, tf_model.h5, model.ckpt.index or flax_model.msgpack found in directory

Thanks for the update. Good thing it's Friday, so I'll have time to look into how this works later on (and test it myself first). I'll give you an update once it's done (so if you see arbitrary new files here - that just means I am experimenting, and not necessarily that it works just yet).

Traceback (most recent call last):
  File "C:\Users\zer0int\4CLIP-HF\get-cos.py", line 12, in <module>
    model = CLIPModel.from_pretrained(
  File "C:\Users\zer0int\AppData\Roaming\Python\Python310\site-packages\transformers\modeling_utils.py", line 3738, in from_pretrained
    if metadata.get("format") == "pt":
AttributeError: 'NoneType' object has no attribute 'get'

...Fine-tuning a CLIP model is really easy, compared to figuring out HuggingFace config requirements!
But, I came up with a workaround. We can just pretend we are loading the "openai/clip-vit-large-patch14" model and THEN swap out the state_dict of that model with my model.

Download the FULL model you want to use, and replace your-chosen-model.safetensors with that filename. Here's an example script (requires a photo of a cat.jpg to run):

import torch
from transformers import CLIPProcessor, CLIPModel
from safetensors.torch import load_file
from PIL import Image
import torch.nn.functional as F

# Step 1: Load the pre-trained CLIP model from Hugging Face (openai version)
model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")

# Step 2: Load the custom fine-tuned weights from your local safetensors file
safetensors_path = "your-chosen-model.safetensors"
custom_weights = load_file(safetensors_path)

# Get the current state_dict from the Hugging Face model
model_state_dict = model.state_dict()

# Update the state_dict with handling unexpected / missing by using donor keys
for key in custom_weights:
    if key in model_state_dict:
        model_state_dict[key] = custom_weights[key]  # Overwrite with custom weights

# Load the modified state_dict into the model
model.load_state_dict(model_state_dict)

# Tokenizer / Pre-processing is the exact same for all CLIP, no matter if fine-tuned or not
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")

image = Image.open("cat.jpg")
inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
inputs = {key: value.to(device) for key, value in inputs.items()}

with torch.no_grad():
    image_embeds = model.get_image_features(**{"pixel_values": inputs["pixel_values"]})
    text_embeds = model.get_text_features(**{"input_ids": inputs["input_ids"], "attention_mask": inputs["attention_mask"]})

image_embeds = F.normalize(image_embeds, p=2, dim=-1)
text_embeds = F.normalize(text_embeds, p=2, dim=-1)

cosine_sim = torch.matmul(text_embeds, image_embeds.T)

# Just dump it raw, it's just an example anyway :-)
print(f"Cosine similarity (text vs image):\n{cosine_sim}")

I hope this works for you. Sorry I couldn't figure out a better solution at this point!

Thanks for the workaround, I've been at this for days now. I'm surprised this didn't show up on google when I searched for the error.

I get this error from your workaround, any ideas on how to remediate?
Traceback (most recent call last):
File "/root/.pyenv/versions/3.11.10/lib/python3.11/site-packages/cog/server/worker.py", line 312, in _setup
run_setup(self._predictor)
File "/root/.pyenv/versions/3.11.10/lib/python3.11/site-packages/cog/predictor.py", line 89, in run_setup
predictor.setup()
File "/src/predict.py", line 230, in setup
self.txt2img_pipe = FluxPipeline.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.10/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.10/lib/python3.11/site-packages/diffusers/pipelines/pipeline_utils.py", line 852, in from_pretrained
maybe_raise_or_warn(
File "/root/.pyenv/versions/3.11.10/lib/python3.11/site-packages/diffusers/pipelines/pipeline_loading_utils.py", line 263, in maybe_raise_or_warn
raise ValueError(
ValueError: CLIPTextTransformer(
(embeddings): CLIPTextEmbeddings(
(token_embedding): Embedding(49408, 768)
(position_embedding): Embedding(77, 768)
)
(encoder): CLIPEncoder(
(layers): ModuleList(
(0-11): 12 x CLIPEncoderLayer(
(self_attn): CLIPSdpaAttention(
(k_proj): Linear(in_features=768, out_features=768, bias=True)
(v_proj): Linear(in_features=768, out_features=768, bias=True)
(q_proj): Linear(in_features=768, out_features=768, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(mlp): CLIPMLP(
(activation_fn): QuickGELUActivation()
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(fc2): Linear(in_features=3072, out_features=768, bias=True)
)
(layer_norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
)
)
(final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
) is of type: <class 'transformers.models.clip.modeling_clip.CLIPTextTransformer'>, but should be <class 'transformers.modeling_utils.PreTrainedModel'>
Traceback (most recent call last):
File "/root/.pyenv/versions/3.11.10/lib/python3.11/site-packages/cog/server/runner.py", line 222, in _handle_done
f.result()
File "/root/.pyenv/versions/3.11.10/lib/python3.11/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.10/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
cog.server.exceptions.FatalWorkerException: Predictor errored during setup: CLIPTextTransformer(
(embeddings): CLIPTextEmbeddings(
(token_embedding): Embedding(49408, 768)
(position_embedding): Embedding(77, 768)
)
(encoder): CLIPEncoder(
(layers): ModuleList(
(0-11): 12 x CLIPEncoderLayer(
(self_attn): CLIPSdpaAttention(
(k_proj): Linear(in_features=768, out_features=768, bias=True)
(v_proj): Linear(in_features=768, out_features=768, bias=True)
(q_proj): Linear(in_features=768, out_features=768, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(mlp): CLIPMLP(
(activation_fn): QuickGELUActivation()
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(fc2): Linear(in_features=3072, out_features=768, bias=True)
)
(layer_norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
)
)
(final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
) is of type: <class 'transformers.models.clip.modeling_clip.CLIPTextTransformer'>, but should be <class 'transformers.modeling_utils.PreTrainedModel'>

Here's an update: The model now works when you load it 'normally'.

from transformers import CLIPProcessor, CLIPModel

model_id = "zer0int/CLIP-GmP-ViT-L-14"

model = CLIPModel.from_pretrained(model_id)
processor = CLIPProcessor.from_pretrained(model_id)

But: ⚠️⚠️ DO NOT USE THE MODEL! ⚠️⚠️ Here's why:

CLIPModel.from_pretrained("openai/clip-vit-large-patch14")

Cosine similarity (image vs 'A photo of a cat'): 0.2330581396818161
Cosine similarity (image vs 'A picture of a dog'): 0.15255104005336761
Cosine similarity (image vs 'cat'): 0.21000739932060242
Cosine similarity (image vs 'dog'): 0.14514459669589996

So, that's OpenAI's CLIP working as expected.
I have re-converted my fine-tuned model to HuggingFace model.safetensors, from the original pickle file - just to make sure. And:

CLIPModel.from_pretrained("zer0int/CLIP-GmP-ViT-L-14")

Image vs 'A photo of a cat': 0.05461934581398964
Image vs 'A picture of a dog': 0.030599746853113174
Image vs 'cat': -0.0010263863950967789
Image vs 'dog': 0.004391679540276527

I have no idea what is going on with that derailed cosine similarity. 🧐

And, to make sure, I am loading the exact same model file I have used for conversion to .safetensors, but this time, I am using the pytorch-pickle.pt file as-is.
Loading the exact same model, but from the pickle - not in huggingface format. Doing the exact same thing. And I get:

# Loading original torch.save pickle of my fine-tune.

Cosine similarity (image vs 'A photo of a cat'): 0.2086181640625
Cosine similarity (image vs 'A picture of a dog'): 0.08636474609375
Cosine similarity (image vs 'cat'): 0.1849365234375
Cosine similarity (image vs 'dog'): 0.0947265625

This is absolutely as expected. Slightly less confident than original CLIP about this being a "cat" - but absolutely SUPER confident that this is NOT a dog.
That re-organization of embeddings is why my model outperforms the original one. Working as expected.

I have no idea what is going on - been hours on this non-stop. I guess I'll have to ask around on huggingface.

You can try loading and using the model (just for the sake of verifying it works), but again: It is completely ruined and should not be used by anybody for anything. ⚠️

Just leaving this here in case somebody loads it, then comes here to complain about the 'worst model ever', and hopefully sees this. I will update once I have news. Sorry about the wait!

The model is finally fixed! 🌟πŸ₯³

You should be able to use it in the 'normal' way with just zer0int/CLIP-GmP-ViT-L-14 πŸ€—
(see above community thread link for tech details, if interested)

It still got error when try to replace flux's clip.
"AttributeError: 'CLIPTextTransformer' object has no attribute 'dtype'. Did you mean: 'type'?"

model_id = ("zer0int/CLIP-GmP-ViT-L-14")

clip_model = CLIPModel.from_pretrained(model_id)
clip_processor = CLIPProcessor.from_pretrained(model_id)

pipe = FluxPipeline.from_pretrained(bfl_repo, tokenizer=None, text_encoder=None).to("cuda")

pipe.tokenizer = clip_processor.tokenizer  # Replace with the CLIP tokenizer
pipe.text_encoder = clip_model.text_model.to("cuda")  # Replace with the CLIP text encoder

And if you use openai/clip-vit-large-patch14 instead, it works?

Thanks for your reminder, it doesn't work either.
I'll check what's wrong.

I found a workaround:
In modeling_clip.py of Transformers.
class CLIPTextTransformer(nn.Module):
def init(self, config: CLIPTextConfig):
super().init()
self.dtype = torch.bfloat16 (add this line)

Sign up or log in to comment