File size: 5,383 Bytes
e610900
 
 
afb095a
e610900
63d6f69
 
 
0568c73
401c54a
 
 
 
0568c73
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5f19dde
0568c73
5f19dde
 
0568c73
5f19dde
 
 
 
 
 
 
 
 
 
 
 
0568c73
 
 
 
 
 
 
9736398
 
 
 
 
 
 
 
 
 
3300319
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
07f6f14
3300319
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
---
datasets:
- SPRIGHT-T2I/spright_coco
base_model: BeichenZhang/LongCLIP-L
---
## A fine-tune of Long-CLIP - original model: [BeichenZhang/LongCLIP-L](https://huggingface.co/BeichenZhang/LongCLIP-L)
- ❀️ this CLIP? [Help feed it](https://ko-fi.com/zer0int) if you can. Besides data, CLIP eats time & expensive electricity of DE. TY! πŸ€—
- Want to feed it yourself? All code for fine-tuning and much more is on [my GitHub](https://github.com/zer0int).
----
- # Note for using Long-CLIP as the Text Encoder with Flux.1, SDXL, Stable Diffusion: 
- Get the ComfyUI Long-CLIP nodes here: [https://github.com/SeaArtLab/ComfyUI-Long-CLIP](https://github.com/SeaArtLab/ComfyUI-Long-CLIP)
- If you don't use Comfy, it's at least a starting point for reverse engineering & applying it to your code! πŸ€—
----
# 🚨 IMPORTANT NOTE for loading with HuggingFace Transformers: πŸ‘€

```
model_id = "zer0int/LongCLIP-GmP-ViT-L-14"

model = CLIPModel.from_pretrained(model_id)
processor = CLIPProcessor.from_pretrained(model_id)
```
# ❌ Error due to mismatch with defined 77 tokens in Transformers library

# πŸ‘‡
# Option 1 (simple & worse):
Truncate to 77 tokens
`CLIPModel.from_pretrained(model_id, ignore_mismatched_sizes=True)`

```
# Cosine similarities for 77 tokens is WORSE:
# tensor[photo of a cat, picture of a dog, cat, dog] # image ground truth: cat photo
tensor([[0.16484, 0.0749, 0.1618, 0.0774]], device='cuda:0') πŸ“‰
```
# πŸ‘‡
# Option 2, proper integration: πŸ’– RECOMMENDED πŸ’–

- ### Solution for implementation of 248 tokens / thanks [@kk3dmax ](https://huggingface.co/zer0int/LongCLIP-GmP-ViT-L-14/discussions/3) πŸ€—
- Obtain a full example script using this solution for Flux.1 inference on [my GitHub](https://github.com/zer0int/CLIP-txt2img-diffusers-scripts)

```
model_id = ("zer0int/LongCLIP-GmP-ViT-L-14")
config = CLIPConfig.from_pretrained(model_id)
config.text_config.max_position_embeddings = 248
clip_model = CLIPModel.from_pretrained(model_id, torch_dtype=dtype, config=config)
clip_processor = CLIPProcessor.from_pretrained(model_id, padding="max_length", max_length=248)

pipe.tokenizer = clip_processor.tokenizer  # Replace with the CLIP tokenizer
pipe.text_encoder = clip_model.text_model  # Replace with the CLIP text encoder
pipe.tokenizer_max_length = 248
pipe.text_encoder.dtype = torch.bfloat16
```

```
# Resulting Cosine Similarities for 248 tokens padded:
# tensor[photo of a cat, picture of a dog, cat, dog] -- image ground truth: cat photo
tensor([[0.2128, 0.0978, 0.1957, 0.1133]], device='cuda:0') βœ…
```

----
## Update 12/AUG/2024:
New *BEST* model, custom loss with label smoothing.
Small gain for a diverse and large good quality dataset, but big relative gains for an overfit-prone fine-tune (small batch size, 1 GPU, narrow dataset of e.g. 'sneakers', etc.) are possible!
Fine-tune your model with the provided code for GmP-Smooth: [https://github.com/zer0int/Long-CLIP](https://github.com/zer0int/Long-CLIP)


![image/png](/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F6490359a877fc29cb1b09451%2Fl3FYkaicihqXv5D9wLDAF.png%3C%2Fspan%3E)

----

The fine-tune has an improved ImageNet/ObjectNet accuracy of 0.89 (original Long-CLIP by the authors:~0.81)**.


Made possible with Geometric Parametrization (GmP):

```

"Normal" CLIP MLP (multi-layer perceptron):

(mlp): Sequential(
  |-(c_fc): Linear(in_features=1024, out_features=4096, bias=True)
  | (gelu): QuickGELU()
|-}-(c_proj): Linear(in_features=4096, out_features=1024, bias=True)
| | 
| |-- visual.transformer.resblocks.0.mlp.c_fc.weight
| |-- visual.transformer.resblocks.0.mlp.c_fc.bias
|
|---- visual.transformer.resblocks.0.mlp.c_proj.weight
|---- visual.transformer.resblocks.0.mlp.c_proj.bias


GmP CLIP MLP:

Weight decomposition into:
- radial component 'r' as norm of pre-trained weights
- angular component 'theta' as normalized direction
-> preserves weight vectors' directionality and magnitude

(mlp): Sequential(
  |-(c_fc): GeometricLinear()
  | (gelu): QuickGELU()
|-}-(c_proj): GeometricLinear()
| | 
| |-- visual.transformer.resblocks.0.mlp.c_fc.r
| |-- visual.transformer.resblocks.0.mlp.c_fc.theta
| |-- visual.transformer.resblocks.0.mlp.c_fc.bias
|
|---- visual.transformer.resblocks.0.mlp.c_proj.r
|---- visual.transformer.resblocks.0.mlp.c_proj.theta
|---- visual.transformer.resblocks.0.mlp.c_proj.bias

(Same thing for [text] transformer.resblocks)

```

![image/png](/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F6490359a877fc29cb1b09451%2FOqhNxW-D9c58mkZyUQlL_.png%3C%2Fspan%3E)%3C%2Fspan%3E

βœ… The model / state_dict I am sharing was converted back to .weight after fine-tuning - alas, it can be used in the same manner as any state_dict, e.g. for use with ComfyUI as the SDXL / SD3 Text Encoder using [SeaArtLab/ComfyUI-Long-CLIP](https://github.com/SeaArtLab/ComfyUI-Long-CLIP) custom nodes! πŸ€—

** For details on training and those numbers / the eval, or for just fine-tuning the model yourself, see: [https://github.com/zer0int/Long-CLIP](https://github.com/zer0int/Long-CLIP)

```
@article{zhang2024longclip,
        title={Long-CLIP: Unlocking the Long-Text Capability of CLIP},
        author={Beichen Zhang and Pan Zhang and Xiaoyi Dong and Yuhang Zang and Jiaqi Wang},
        journal={arXiv preprint arXiv:2403.15378},
        year={2024}
}
```

Pre-trained CLIP model by OpenAI, License: [MIT License](https://github.com/openai/CLIP/blob/main/LICENSE)