metadata

datasets:
  - ILSVRC/imagenet-1k
license: mit
language:
  - en
base_model:
  - xuantonglll/ELM

This is the model release of the paper

Elucidating the design space of language models for image generation

You may check the paper: arXiv, code: Github

We provide 4 Binary-Autoencoder (BAE) tokenizers, following Binary Latent Diffusion, with code dimension 16, 10, 24 and 32, each trained for 1,000,000 iterations with batch size 256.

Code Dim	Bernoulli Sampling	Link	Size
16	✅	link	332MB
16	❌	link	332MB
20	✅	link	332MB
24	✅	link	332MB

The generation model architecture is adapted from Llama2, following LlameGen.

Model	Link	Size
AR-L	[1-16] [2-8] [2-10] [2-12]	1.25GB~1.77GB
AR-XL	[1-16] [2-8] [2-10] [2-12]	2.95GB~3.6GB
AR-XXL	[1-16] [2-10] [2-12]	5.49GB~6.25GB
AR-2B	[2-12]	7.64GB
MLM-L	[1-16]	1.51GB
MLM-XL	[1-16]	3.27GB
MLM-XXL	[1-16]	5.86GB