The first open Stable Diffusion 3-like architecture model is JUST out š£ - but it is not SD3! š¤
It is Tencent-Hunyuan/HunyuanDiT by Tencent, a 1.5B parameter DiT (diffusion transformer) text-to-image model š¼ļøāØ, trained with multi-lingual CLIP + multi-lingual T5 text-encoders for english š¤ chinese understanding
š 3 text-encoders: 2 CLIPs, one T5-XXL; plug-and-play: removing the larger one maintains competitiveness
šļø Dataset was deduplicated with SSCD which helped with memorization (no more details about the dataset tho)
Variants š A DPO fine-tuned model showed great improvement in prompt understanding and aesthetics āļø An Instruct Edit 2B model was trained, and learned how to do text-replacement
Results ā State of the art in automated evals for composition and prompt understanding ā Best win rate in human preference evaluation for prompt understanding, aesthetics and typography (missing some details on how many participants and the design of the experiment)