VSehwag24
/

MicroDiT

Model card Files Files and versions Community

VSehwag24 commited on 8 days ago

Commit

eaa27f5

verified ·

1 Parent(s): ac17a4e

Update README.md

Browse files

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -24,7 +24,7 @@ We provide checkpoints of four pre-trained models. The table below provides desc
 | MicroDiT_XL_2 trained on 37M images (22M real, 15 synthetic) | Ostris-VAE (16 channel) | 13.04 | 0.40 | dit_16_channel_37M_real_and_synthetic_data.pt  |
 | MicroDiT_XL_2 trained on 490M synthetic images | SDXL-VAE (4 channel) | 13.26 | **0.52** | dit_4_channel_0.5B_synthetic_data.pt |
-**Image generation** These checkpoints can be used with the official micro_diffusion codebase for image generation. First install the micro_diffusion as a python package `pip install git+https://github.com/SonyResearch/micro_diffusion.git
 Next use the following straightforward steps to generate images from the final model at 512×512 resolution.
 ```
@@ -35,7 +35,7 @@ gen_images = model.generate(prompt=['An elegant squirrel pirate on a ship']*4, n
                                     guidance_scale=5.0, seed=2024)
 ```
-**Training pipeline**: All four models are trained with nearly identical training configurations and computational budgets. We progressively train each model from low resolution to high resolution. We first train the model on 256×256 resolution images for 280K steps and then fine-tune the model for 55K steps on 512×512 resolution images. The estimated training time for the end-to-end model on an 8×H100 machine is 2.6 days. Our MicroDiT models by default use a patch-mixer before the backbone transformer architecture. Using the patch-mixer significantly reduces performance degradation with masking while providing a large reduction in training time. We mask 75% of the patches after the patch mixer across both resolutions. After training with masking, we perform a follow-up fine-tuning with a mask ratio of 0 to slightly improve performance.
 The models are released under Apache 2.0 License.

 | MicroDiT_XL_2 trained on 37M images (22M real, 15 synthetic) | Ostris-VAE (16 channel) | 13.04 | 0.40 | dit_16_channel_37M_real_and_synthetic_data.pt  |
 | MicroDiT_XL_2 trained on 490M synthetic images | SDXL-VAE (4 channel) | 13.26 | **0.52** | dit_4_channel_0.5B_synthetic_data.pt |
+**Image generation:** These checkpoints can be used with the official micro_diffusion codebase for image generation. First install the micro_diffusion code as a python package `pip install git+https://github.com/SonyResearch/micro_diffusion.git
 Next use the following straightforward steps to generate images from the final model at 512×512 resolution.
 ```
                                     guidance_scale=5.0, seed=2024)
 ```
+**Training pipeline:** All four models are trained with nearly identical training configurations and computational budgets. We progressively train each model from low resolution to high resolution. We first train the model on 256×256 resolution images for 280K steps and then fine-tune the model for 55K steps on 512×512 resolution images. The estimated training time for the end-to-end model on an 8×H100 machine is 2.6 days. Our MicroDiT models by default use a patch-mixer before the backbone transformer architecture. Using the patch-mixer significantly reduces performance degradation with masking while providing a large reduction in training time. We mask 75% of the patches after the patch mixer across both resolutions. After training with masking, we perform a follow-up fine-tuning with a mask ratio of 0 to slightly improve performance.
 The models are released under Apache 2.0 License.