Announcing NVIDIA Cosmos World Foundation Models

Community Article Published January 7, 2025

TL;DR

We introduce NVIDIA Cosmos™ world foundation models (WFMs), a family of pre-trained models purpose-built for generating physics-aware videos and world states to advance physical AI development. The family includes autoregressive and diffusion models designed for Text-to-World and Video-to-World generation, providing an excellent foundation for developers to build world models for robotics, autonomous vehicles, and machines. Key use cases include:

Policy model development and evaluation
Predictive foresight modeling
Integration with the NVIDIA Omniverse platform for multiverse simulation

To ensure the safe usage of these models, we present Cosmos guardrails, a state-of-the-art system with pre- and post-generation guards to maintain prompt integrity and output consistency.

We benchmark Cosmos WFMs for 3D consistency and physical alignment. Cosmos models consistently outperform baseline video synthesis models in these evaluations. We are also making Cosmos benchmarks openly available to advance and evaluate future world foundation models.

These open models are part of the Cosmos platform, which also includes data curation tools, tokenizers, and frameworks that enable faster and efficient fine-tuning of Cosmos world foundation models.

Quick Links

Developing Physical AI with Cosmos World Foundation Models (WFMs)

The Cosmos world foundation model family includes pre-trained, purpose-built models for generating physics-aware videos and world states from text, image, or video input – advancing physical AI development. These models are generalists that capture general knowledge of real-world physics and natural behaviors.

We exploit two different scalable deep learning paradigms:

Diffusion models: Break the generation problem into a sequence of denoising tasks.
Autoregressive models: Solve the problem as a sequence of next-token prediction tasks.

Cosmos world foundation models are trained on:

20 million hours of video data (equivalent to 9,000 trillion tokens)
Using 10,000 NVIDIA H100 GPUs over three months

This data includes hand motions, object manipulation, spatial awareness, navigation, and camera movements. Pre-training curation ensures the models are optimized for advancing robotics, autonomous vehicles, and other physical AI systems.

Table 1: Maps of Cosmos world foundation models 1.0 Release. We have two sets of WFMs. One is based on diffusion models, while the other is based on autoregressive models. For each family, we build two base models and two derivative models. To achieve the best generation quality, we also build a prompt upsampler for the diffusion models and a diffusion decoder for the autoregressive models.

Autoregressive Models

Cosmos autoregressive models predict future video frames with higher precision and speed, using input text, images, and past video frames for context.

The architecture is tailored for physical AI use cases, enhancing control in generation, reducing training loss, and minimizing visual artifacts with the help of positional embeddings. Additional cross-attention layers improve text comprehension, while normalization techniques add stability – all ensuring consistent, faster, and physics-aware realistic outputs.

Figure 1: . The pipeline begins by encoding input video through the encoder of Cosmos-1.0-Tokenizer-DV8x16x16 to generate discrete tokens, which are transformed into learned embeddings. These embeddings are processed through repeated transformer blocks, each consisting of absolute positional embedding and 3D RoPE components that are flattened before entering the self-attention module. Each block also includes a cross-attention module that incorporates encoded text prompts (processed via a T5 text encoder), followed by a two-layer MLP. Finally, the decoder of Cosmos-1.0-Tokenizer-DV8x16x16 reconstructs the video from the output tokens.

Pre-training of Cosmos Autoregressive WFMs

The pre-training of Cosmos autoregressive world foundation models follows a structured multi-stage approach to ensure robust performance in video prediction and text-conditioned generation tasks.

Stage 1: The model begins with a video prediction objective, trained to predict 16 future frames using the first frame as the input condition. A context length of 17 frames is used for this stage.
Stage 1.1: The context length is increased to 34 frames using the YaRN extension for temporal RoPE, enabling the model to capture longer video dependencies.
Stage 2: Text conditioning is introduced, incorporating text embeddings through newly added cross-attention layers. The model is trained with a 34-frame context, leveraging both image and video data for joint training. For image batches, larger batch sizes are used due to their smaller context lengths.

The models are trained at a fixed resolution of 640 × 1024. After pre-training, a cooling-down phase is conducted, where the learning rate is linearly reduced to zero over 30,000 iterations while training on high-quality image-video pairs, refining the model for high-fidelity output.

Model Variants in this release:

Cosmos-1.0-Autoregressive-4B: A 4B transformer model trained on Stage 1 and Stage 1.1 objectives for next video token prediction.
Cosmos-1.0-Autoregressive-5B-Video2World: Derived from the 4B model, it incorporates cross-attention layers and is further trained with Stage 2 for text-conditioned video generation.
Cosmos-1.0-Autoregressive-12B: A larger 12B transformer model trained on Stage 1 and Stage 1.1, designed for advanced next video token prediction.
Cosmos-1.0-Autoregressive-13B-Video2World: Built from the 12B model, it includes cross-attention layers and undergoes additional Stage 2 training for text-to-video tasks.

Diffusion Models

Our diffusion-based WFMs are latent diffusion models that operate within a learned latent space of a tokenizer, enabling a compact, reduced-dimensional representation of videos. This design choice offers several advantages:

Reduces computational costs during both training and inference
Simplifies the denoising task

To tokenize videos into latent representations, we employ Cosmos-1.0-Tokenizer-CV8x8x8.

The training of these diffusion models incorporates several advanced techniques to optimize performance and efficiency. 3D patchification breaks video or image data into non-overlapping 3D patches, converting them into token sequences for the network while preserving spatial and temporal relationships. To handle varying video sizes, aspect ratios, and frame rates, FPS-aware 3D Rotary Position Embedding (RoPE) encodes positional information across temporal, height, and width dimensions, enabling seamless adaptation during progressive training. Text conditioning is achieved through cross-attention layers, which integrate semantic context from T5-XXL embeddings with visual tokens for effective text-to-video generation. Query-key normalization stabilizes training by normalizing attention components with Root Mean Square Normalization (RMSNorm), preventing issues like attention collapse. Additionally, AdaLN-LoRA reduces model parameters by 36% (e.g., from 11B to 7B) through low-rank approximations of dense layers in adaptive normalization, maintaining accuracy while improving efficiency. Together, these innovations streamline training, enhance video generation quality, and enable effective text-based control.

Figure 2: Overall architecture of Cosmos-1.0-Diffusion world foundation model. The model processes an input video through the encoder of the Cosmos-1.0-Tokenizer-CV8x8x8 to obtain latent representations, which are subsequently perturbed with Gaussian noise. These representations are then transformed using a 3D patchification process. In the latent space, the architecture applies repeated blocks of self-attention, crossattention (integrating input text), and feed-forward MLP layers, modulated by adaptive layer normalization (scale, shift, gate) for a given time step 𝑡. The decoder of Cosmos-1.0-Tokenizer-CV8x8x8 reconstructs the final video output from the refined latent representation.

Pre-training of Cosmos Diffusion WFMs:

The training methodology for Cosmos diffusion WFMs is designed to handle diverse datasets, resolutions, aspect ratios, and conditioning inputs effectively. Joint image and video training leverages high-quality image datasets alongside video data using a domain-specific normalization scheme to align latent distributions and improve generation quality. Progressive training begins with low-resolution videos (512p) and transitions to higher resolutions (720p) with increased frame counts, followed by fine-tuning on high-quality subsets. Multi-aspect training organizes data into aspect ratio buckets (e.g., 1:1, 16:9) and uses reflection padding to preserve content details during resizing. Mixed-precision training optimizes efficiency by maintaining weights in BF16 for speed and FP32 for stability, minimizing loss spikes. Text conditioning integrates T5-XXL embeddings for Text2World models, ensuring strong alignment between prompts and generated visuals. For image and video conditioning, previous frames are concatenated with generated ones during training, with added noise to improve robustness and flexibility.

Model Variants in this release:

Cosmos-1.0-Diffusion-7B-Text2World and Cosmos-1.0-Diffusion-14B-Text2World: Generate a 121-frame video from a text description.
Cosmos-1.0-Diffusion-7B-Video2World and Cosmos-1.0-Diffusion-14B-Video2World: Generate the next 120 frames from a text description and an initial image frame.

Ensuring Safety with Cosmos Guardrails

We are openly releasing Cosmos guardrails to encourage safe and trustworthy AI in the physical AI developer community. Cosmos guardrails operate in two stages:

Pre-guard: Scans prompts for unsafe content, using blocklist checks and fine-tunedAegis AI Content Safety models.
Post-guard: Evaluates video outputs frame-by-frame, rejecting unsafe videos. Human faces are blurred for privacy and bias reduction.

Figure 3: Cosmos guardrails include pre-guard for text prompt-based safety and post-guard ensuring safe video outputs.

How to use Cosmos world foundation models for developing downstream physical AI models or custom world models?

Cosmos uses a two-stage training approach to build versatile world models:

Generalist Models: Trained on diverse datasets of real-world physics and environments, Cosmos world foundation models are generalist that can handle a wide range of scenarios, from natural dynamics to robotic interactions, providing a strong foundation for physical AI tasks.

Specialist Models: Developers can fine-tune generalist models with smaller, targeted datasets to create specialists tailored for applications like autonomous driving, humanoid robotics, or custom scenarios such as night scenes with emergency vehicles or industrial robotics. Fine-tuning reduces both data and training time compared to training models from scratch.

NVIDIA Cosmos is a world foundation model development platform that streamlines training with efficient video processing, high-performance tokenizers, and advanced frameworks, enabling developers to address complex operational needs quickly and effectively.

Accelerated Data Processing with NVIDIA NeMo Curator

High-quality data is critical for training models but can be time-consuming to prepare. Cosmos integrates NVIDIA NeMo Curator, optimized for NVIDIA GPUs, to process massive datasets efficiently. For example, 20 million hours of video can be processed in just 14 days on NVIDIA Blackwell GPUs compared to 3.4 years on CPU pipelines.

Key Benefits:

89x Faster Curation: Reduces processing time dramatically.
Scalability: Handles datasets exceeding 100 PB.
High Throughput: Maintains quality with advanced filtering, captioning, and embedding.

High-Fidelity Compression with Cosmos Tokenizer

Cosmos Tokenizer is a suite of visual tokenizers for images and videos that delivers various compression rates while maintaining high reconstruction quality. Cosmos Tokenizer can serve as an effective and efficient building block in both diffusion-based and autoregressive models for image and video generation.

Autoregressive Models: Achieve 8x time and 16x16 space compression, processing up to 49 frames at once.
Diffusion Models: Enable 8x time and 8x8 space compression, handling up to 121 frames.

This reduces costs and complexity while preserving visual quality, ensuring models can process large datasets efficiently.

Fine-Tuning with NVIDIA NeMo Framework

Cosmos models can be fine-tuned using open NVIDIA NeMo Framework, which accelerates training on GPU-powered systems for both existing and new models.

Shards large datasets to reduce I/O overhead.
Saves and loads datasets deterministically to minimize repetition and compute waste.
Optimizes communications to reduce network bandwidth usage.

These tools make fine-tuning faster and more efficient, whether on-premises or in the cloud.

Real-time Inference Performance:

Cosmos-1.0-Autoregressive-4B model delivers efficient inference performance on 8x NVIDIA H100 GPUs. Using a 320x512 resolution and 10 FPS video setup, the model processes 9 input frames (0.9 seconds, 1280 tokens) to generate 24 future frames (2.4 seconds, 1920 tokens) with a throughput of 806 tokens per second, completing the task in just 2.38 seconds. Key development efforts include a low-resolution tokenizer, pretrained on general data and fine-tuned for physical AI domains like AV and robotics, and fine-tuning the autoregressive model with this tokenizer. Additionally, speculative decoding was optimized with Medusa, fine-tuned on AV data from the Alpamayo dataset. This setup ensures high efficiency and precision for physical AI applications.

Evaluating World Foundation Models with Cosmos Benchmarks

We are openly releasing Cosmos benchmarks developed by NVIDIA Research with Stanford University and University of Toronto to help the physical AI community to evaluate world foundation models. Cosmos benchmarks evaluate models on 3D consistency and physical alignment necessary for robotics and autonomous vehicles development. Our first generation models are outperforming baseline VideoLDM world models on these measures.

3D Consistency

Evaluates geometric accuracy using Sampson error, which measures the first-order approximation of the distance between an interest point and its corresponding epipolar line in another view. Lower Sampson error indicates better geometric understanding. Cosmos diffusion and autoregressive models achieve lower Sampson error and higher pose estimation success rates compared to the baseline model, VideoLDM, demonstrating superior geometric understanding.

Visual Consistency and Fidelity

Assessed using metrics like Peak Signal-to-Noise Ratio (PSNR), which measures the ratio between signal power and noise in an image. Higher PSNR values indicate better visual quality. Cosmos models consistently outperform the baseline, delivering enhanced temporal consistency.

Figure 4: Both Cosmos diffusion and autoregressive models demonstrate lower geometric error and higher pose estimation success rates, highlighting their superior geometric understanding than baseline model.

Physical Alignment

While training and evaluating physical alignment remains a complex task, we continue to improve our models and test them on physics laws, using a system that evaluates intuitive physics through pixel, object, and feature-level metrics. Cosmos WFMs show improved object kinematics prediction when conditioned on more frames. Models conditioned on a prompt + 9 frames outperform those with a prompt + 1 frame in pixel-level accuracy and visual quality, emphasizing the value of data curation and model design that can be conditioned for better physical alignment.

Future Work

We are achieving improved results with continuous training and conditioning, consistently surpassing benchmarks and enhancing performance. Looking ahead, we aim to integrate NVIDIA Cosmos with extended physical AI platforms like NVIDIA Omniverse to address real-world challenges in robotics, autonomous vehicles, and machines.

While Cosmos world foundation models are evolving to achieve better physical alignment, we recognize the journey toward true real-world physical understanding is ongoing. We remain committed to strengthen model capabilities by enhancing data curation and improving model design.

Get started with NVIDIA Cosmos and tune into the AI Podcast featuring Vice President of Research at NVIDIA Ming-Yu Liu airing January 7th.

Upvote