Abstract
We introduce Genie, the first generative interactive environment trained in an unsupervised manner from unlabelled Internet videos. The model can be prompted to generate an endless variety of action-controllable virtual worlds described through text, synthetic images, photographs, and even sketches. At 11B parameters, Genie can be considered a foundation world model. It is comprised of a spatiotemporal video tokenizer, an autoregressive dynamics model, and a simple and scalable latent action model. Genie enables users to act in the generated environments on a frame-by-frame basis despite training without any ground-truth action labels or other domain-specific requirements typically found in the world model literature. Further the resulting learned latent action space facilitates training agents to imitate behaviors from unseen videos, opening the path for training generalist agents of the future.
Community
Looks really interesting!
great work! !
incroyable !
What.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Large-Scale Actionless Video Pre-Training via Discrete Diffusion for Efficient Policy Learning (2024)
- Compositional Generative Modeling: A Single Model is Not All You Need (2024)
- Collaboratively Self-supervised Video Representation Learning for Action Recognition (2024)
- An Interactive Agent Foundation Model (2024)
- Generative Human Motion Stylization in Latent Space (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Can this model be Lego-ed / broken apart and recombined to do other things? Especially since the LAM seem to be already so capable with a bit of additional FT.
For e.g., I'd imagine you can predict the next optimal action (assuming the dynamics model is trained to play the game well) by just reconfiguring things a bit (without doing additional training or FT?):
Send a sequence of video frames $(z_1, ... z_{t-1})$ and actions $(a_1, ...)$ into the dynamics model to get the next video frame token $z_t$, then use the LAM encoder to annotate and output the next action to take (assuming the LAM encoder can autoregressively generate the inputs).
You can do this autoregressively to "hallucinate" the gameplay, or you can use this as a frame-by-frame agent to play the game.
Genie: Creating Interactive Worlds from Unlabeled Videos
Links π:
π Subscribe: https://www.youtube.com/@Arxflix
π Twitter: https://x.com/arxflix
π LMNT (Partner): https://lmnt.com/
Models citing this paper 3
Datasets citing this paper 0
No dataset linking this paper