moegi161
's Collections
Vision and language
updated
No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency
Determines Multimodal Model Performance
Paper
•
2404.04125
•
Published
•
27
CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept
Matching
Paper
•
2404.03653
•
Published
•
33
Cross-Attention Makes Inference Cumbersome in Text-to-Image Diffusion
Models
Paper
•
2404.02747
•
Published
•
11
3D Congealing: 3D-Aware Image Alignment in the Wild
Paper
•
2404.02125
•
Published
•
7
BeyondScene: Higher-Resolution Human-Centric Scene Generation With
Pretrained Diffusion
Paper
•
2404.04544
•
Published
•
20
ControlNet++: Improving Conditional Controls with Efficient Consistency
Feedback
Paper
•
2404.07987
•
Published
•
47
BRAVE: Broadening the visual encoding of vision-language models
Paper
•
2404.07204
•
Published
•
18
RealmDreamer: Text-Driven 3D Scene Generation with Inpainting and Depth
Diffusion
Paper
•
2404.07199
•
Published
•
25
Learning to Route Among Specialized Experts for Zero-Shot Generalization
Paper
•
2402.05859
•
Published
•
5
Improving Explicit Spatial Relationships in Text-to-Image Generation
through an Automatically Derived Dataset
Paper
•
2403.00587
•
Published
ReGround: Improving Textual and Spatial Grounding at No Cost
Paper
•
2403.13589
•
Published
FlexCap: Generating Rich, Localized, and Flexible Captions in Images
Paper
•
2403.12026
•
Published
•
1
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Paper
•
2403.03206
•
Published
•
60
Editable Image Elements for Controllable Synthesis
Paper
•
2404.16029
•
Published
•
10
Move Anything with Layered Scene Diffusion
Paper
•
2404.07178
•
Published
Kaleido Diffusion: Improving Conditional Diffusion Models with
Autoregressive Latent Modeling
Paper
•
2405.21048
•
Published
•
13