|
--- |
|
datasets: |
|
- HuggingFaceFV/finevideo |
|
- LanguageBind/Open-Sora-Plan-v1.0.0 |
|
language: |
|
- ja |
|
- en |
|
library_name: diffusers |
|
license: apache-2.0 |
|
pipeline_tag: text-to-video |
|
tags: |
|
- art |
|
--- |
|
|
|
# Model Card for CommonVideo |
|
|
|
This is a text-to-video model learning from CC-BY, CC-0 like images. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
At AIdeaLab, we develop AI technology through active dialogue with creators, aiming for mutual understanding and cooperation. |
|
We strive to solve challenges faced by creators and grow together. |
|
One of these challenges is that some creators and fans want to use image generation but can't, likely due to the lack of permission to use certain images for training. |
|
To address this issue, we have developed CommonVideo. |
|
|
|
#### Features of CommonVideo |
|
|
|
- Principally uses images with obtained learning permissions |
|
- Understands both Japanese and English text inputs directly |
|
- Minimizes the risk of exact reproduction of training images |
|
- Utilizes cutting-edge technology for high quality and efficiency |
|
|
|
### Misc. |
|
|
|
- **Developed by:** [alfredplpl](https://huggingface.co/alfredplpl), [maty0505](https://huggingface.co/maty0505) |
|
- **Funded by:** AIdeaLab, Inc. |
|
- **Shared by:** AIdeaLab, Inc. |
|
- **Model type:** Rectified Flow Transformer |
|
- **Language(s) (NLP):** Japanese, English |
|
- **License:** Apache-2.0 |
|
|
|
### Model Sources |
|
|
|
- **Repository:** TBA |
|
- **Paper :** [blog](https://note.com/aidealab/n/n677018ea1953) |
|
|
|
## How to Get Started with the Model |
|
|
|
- diffusers |
|
|
|
1. Install libraries. |
|
|
|
```bash |
|
pip install transformers diffusers |
|
``` |
|
|
|
2. Run the following script |
|
|
|
```python |
|
from diffusers.utils import export_to_video |
|
import tqdm |
|
from torchvision.transforms import ToPILImage |
|
import torch |
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
from diffusers import CogVideoXTransformer3DModel, AutoencoderKLCogVideoX |
|
|
|
prompt="チューリップや菜の花、色とりどりの花が果てしなく続く畑を埋め尽くし、まるでパッチワークのようにカラフルに彩る。朝の柔らかな光が花びらを透かし、淡いグラデーションが映える。風に揺れる花々をスローモーションで捉え、花びらが優雅に舞う姿を映画のような演出で撮影。背景には遠くに連なる山並みや青い空、浮かぶ白い雲が立体感を引き立てる。" |
|
device="cuda" |
|
shape=(1,48//4,16,256//8,256//8) |
|
sample_N=25 |
|
torch_dtype=torch.bfloat16 |
|
eps=1 |
|
cfg=2.5 |
|
|
|
tokenizer = AutoTokenizer.from_pretrained( |
|
"llm-jp/llm-jp-3-1.8b" |
|
) |
|
|
|
text_encoder = AutoModelForCausalLM.from_pretrained( |
|
"llm-jp/llm-jp-3-1.8b", |
|
torch_dtype=torch_dtype |
|
) |
|
text_encoder=text_encoder.to(device) |
|
|
|
text_inputs = tokenizer( |
|
prompt, |
|
padding="max_length", |
|
max_length=512, |
|
truncation=True, |
|
add_special_tokens=True, |
|
return_tensors="pt", |
|
) |
|
text_input_ids = text_inputs.input_ids |
|
prompt_embeds = text_encoder(text_input_ids.to(device), output_hidden_states=True, attention_mask=text_inputs.attention_mask.to(device)).hidden_states[-1] |
|
prompt_embeds = prompt_embeds.to(dtype=torch_dtype, device=device) |
|
|
|
null_text_inputs = tokenizer( |
|
"", |
|
padding="max_length", |
|
max_length=512, |
|
truncation=True, |
|
add_special_tokens=True, |
|
return_tensors="pt", |
|
) |
|
null_text_input_ids = null_text_inputs.input_ids |
|
null_prompt_embeds = text_encoder(null_text_input_ids.to(device), output_hidden_states=True, attention_mask=null_text_inputs.attention_mask.to(device)).hidden_states[-1] |
|
null_prompt_embeds = null_prompt_embeds.to(dtype=torch_dtype, device=device) |
|
|
|
# Free VRAM |
|
del text_encoder |
|
|
|
transformer = CogVideoXTransformer3DModel.from_pretrained( |
|
"aidealab/commonvideo", |
|
torch_dtype=torch_dtype |
|
) |
|
transformer=transformer.to(device) |
|
|
|
vae = AutoencoderKLCogVideoX.from_pretrained( |
|
"THUDM/CogVideoX-2b", |
|
subfolder="vae" |
|
) |
|
vae=vae.to(dtype=torch_dtype, device=device) |
|
vae.enable_slicing() |
|
vae.enable_tiling() |
|
|
|
# euler discreate sampler with cfg |
|
z0 = torch.randn(shape, device=device) |
|
latents = z0.detach().clone().to(torch_dtype) |
|
|
|
dt = 1.0 / sample_N |
|
with torch.no_grad(): |
|
for i in tqdm.tqdm(range(sample_N)): |
|
num_t = i / sample_N |
|
t = torch.ones(shape[0], device=device) * num_t |
|
psudo_t=(1000-eps)*(1-t)+eps |
|
positive_conditional = transformer(hidden_states=latents, timestep=psudo_t, encoder_hidden_states=prompt_embeds, image_rotary_emb=None) |
|
null_conditional = transformer(hidden_states=latents, timestep=psudo_t, encoder_hidden_states=null_prompt_embeds, image_rotary_emb=None) |
|
pred = null_conditional.sample+cfg*(positive_conditional.sample-null_conditional.sample) |
|
latents = latents.detach().clone() + dt * pred.detach().clone() |
|
|
|
# Free VRAM |
|
del transformer |
|
|
|
latents = latents / vae.config.scaling_factor |
|
latents = latents.permute(0, 2, 1, 3, 4) # [B, F, C, H, W] |
|
x=vae.decode(latents).sample |
|
x = x / 2 + 0.5 |
|
x = x.clamp(0,1) |
|
x=x.permute(0, 2, 1, 3, 4).to(torch.float32)# [B, F, C, H, W] |
|
print(x.shape) |
|
x=[ToPILImage()(frame) for frame in x[0]] |
|
|
|
export_to_video(x,"output.mp4",fps=24) |
|
``` |
|
|
|
## Uses |
|
|
|
### Direct Use |
|
|
|
- Assistance in creating illustrations, manga, and anime |
|
- For both commercial and non-commercial purposes |
|
- Communication with creators when making requests |
|
- Commercial provision of image generation services |
|
- Please be cautious when handling generated content |
|
- Self-expression |
|
- Using this AI to express "your" uniqueness |
|
- Research and development |
|
- Fine-tuning (also known as additional training) such as LoRA |
|
- Merging with other models |
|
- Examining the performance of this model using metrics like FID |
|
- Education |
|
- Graduation projects for art school or vocational school students |
|
- University students' graduation theses or project assignments |
|
- Teachers demonstrating the current state of image generation AI |
|
- Uses described in the Hugging Face Community |
|
- Please ask questions in Japanese or English |
|
|
|
### Out-of-Scope Use |
|
|
|
- Generate misinfomation or disinformation. |
|
|
|
## Bias, Risks, and Limitations |
|
|
|
TBA |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
We used these dataset to train the transformer: |
|
|
|
- [Pixabay](https://huggingface.co/datasets/LanguageBind/Open-Sora-Plan-v1.0.0) |
|
- [FineVideo](https://huggingface.co/datasets/HuggingFaceFV/finevideo) |
|
|
|
## Technical Specifications |
|
|
|
### Model Architecture and Objective |
|
|
|
## Model Architecture |
|
|
|
[CogVideoX based architecture](https://github.com/THUDM/CogVideo) |
|
|
|
## Objective |
|
|
|
[Rectified Flow](https://github.com/gnobitab/RectifiedFlow) |
|
|
|
#### Software |
|
|
|
[Finetrainers based code](https://github.com/a-r-r-o-w/finetrainers) |
|
|
|
## Model Card Contact |
|
|
|
- [Contact page](https://aidealab.com/contact) |
|
|
|
# Acknowledgement |
|
We approciate the video providers. |
|
So, we are **standing on the shoulders of giants**. |