SD1 Style Components (experimental)
Style control for Stable Diffusion 1.x anime models
What is this?
It is IP-Adapter, but for (anime) styles. Instead of CLIP image embeddings, the image generation is conditioned on 30-dimensional style embeddings, which can either be extracted from an image(s) or created manually.
Why?
Currently, the main means of style control is through artist tags. This method reasonably raises the concern of style plagiarism. By breaking down styles into interpretable components that are present in all artists, direct copying of styles can be avoided. Furthermore, new styles can be easily created by manipulating the magnitude of the style components, offering more controllability over stacking artist tags or LoRAs.
Additionally, this can be potentially useful for general purpose training, as training with style condition may weaken style leakage into concepts. This also serves as a demonstration that image models can be conditioned on arbitrary tensors other than text or images. Hopefully, more people can understand that it is not necessary to force conditions that are inherently numerical (aesthetic scores, dates, ...) into text form tags.
How do I use it?
Currently, a Colab notebook with a gradio interface is available. As this is only an experimental preview, proper support for popular web UIs will not be added before more the models reach a stable state.
Technical details
First, a style embedding model is created by Supervised Contrastive Learning on an artists dataset. Then, from the learned embeddings, the 30 first components of a PCA are extracted. Finally, a modified IP-Adapter is trained on anime-final-pruned using the same dataset with WD1.4 tags and the projected 30-d embeddings. The training resolution is 576*576 with variable aspect ratios.
Acknowledgements
This is largely inspired by Inserting Anybody in Diffusion Models via Celeb Basis and IP-Adapter. Training and inference code is modified from IP-Adapter (license).