VSehwag24
/

MicroDiT

Model card Files Files and versions Community

VSehwag24 commited on 8 days ago

Commit

c531426

·

verified ·

1 Parent(s): 17eed10

Update README.md

Files changed (1) hide show

README.md +4 -1

README.md CHANGED Viewed

@@ -24,7 +24,10 @@ We provide checkpoints of four pre-trained models. The table below provides desc
 | MicroDiT_XL_2 trained on 37M images (22M real, 15 synthetic) | Ostris-VAE (16 channel) | 13.04 | 0.40 | dit_16_channel_37M_real_and_synthetic_data.pt  |
 | MicroDiT_XL_2 trained on 490M synthetic images | SDXL-VAE (4 channel) | 13.26 | **0.52** | dit_4_channel_0.5B_synthetic_data.pt |
-All four models are trained with nearly identical training configurations and computational budgets. The models are released under Apache 2.0 License.
 ## BibTeX
 ```bibtext

 | MicroDiT_XL_2 trained on 37M images (22M real, 15 synthetic) | Ostris-VAE (16 channel) | 13.04 | 0.40 | dit_16_channel_37M_real_and_synthetic_data.pt  |
 | MicroDiT_XL_2 trained on 490M synthetic images | SDXL-VAE (4 channel) | 13.26 | **0.52** | dit_4_channel_0.5B_synthetic_data.pt |
+**Training pipeline**: All four models are trained with nearly identical training configurations and computational budgets. We progressively train each model from low resolution to high resolution. We first train the model on 256×256 resolution images for 280K steps and then fine-tune the model for 55K steps on 512×512 resolution images. The estimated training time for the end-to-end model on an 8×H100 machine is 2.6 days. Our MicroDiT models by default use a patch-mixer before the backbone transformer architecture. Using the patch-mixer significantly reduces performance degradation with masking while providing a large reduction in training time. We mask 75% of the patches after the patch mixer across both resolutions. After training with masking, we perform a follow-up fine-tuning with a mask ratio of 0 to slightly improve performance.
+The models are released under Apache 2.0 License.
 ## BibTeX
 ```bibtext