|
--- |
|
license: mit |
|
library_name: peft |
|
base_model: meta-llama/Meta-Llama-3-8B-Instruct |
|
datasets: |
|
- chenjoya/videollm-online-chat-ego4d-134k |
|
language: |
|
- en |
|
tags: |
|
- llama |
|
- llama-3 |
|
- multimodal |
|
- llm |
|
- video stream |
|
- online video understanding |
|
- video understanding |
|
pipeline_tag: video-text-to-text |
|
--- |
|
|
|
# Model Card for Model ID |
|
|
|
https://showlab.github.io/videollm-online/ |
|
|
|
## Model Details |
|
|
|
* LLM: meta-llama/Meta-Llama-3-8B-Instruct |
|
* Vision Strategy: |
|
* Frame Encoder: google/siglip-large-patch16-384 |
|
* Frame Tokens: CLS Token + Avg Pooled 3x3 Tokens |
|
* Frame FPS: 2 for training, 2~10 for inference |
|
* Frame Resolution: max resolution 384, with zero-padding to keep aspect ratio |
|
* Video Length: 10 minutes |
|
* Training Data: Ego4D Narration Stream 113K + Ego4D GoalStep Stream 21K |
|
|
|
### Model Sources |
|
|
|
<!-- Provide the basic links for the model. --> |
|
|
|
- **Repository:** https://github.com/showlab/videollm-online |
|
- **Paper:** https://arxiv.org/abs/2406.11816 |
|
|
|
## Uses |
|
|
|
- First, clone the github repository and follow the installation instruction: |
|
|
|
```sh |
|
git clone https://github.com/showlab/videollm-online |
|
``` |
|
|
|
Ensure you have Miniconda and Python version >= 3.10 installed, then run: |
|
```sh |
|
conda install -y pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia |
|
pip install transformers accelerate deepspeed peft editdistance Levenshtein tensorboard gradio moviepy submitit |
|
pip install flash-attn --no-build-isolation |
|
``` |
|
|
|
PyTorch source will make ffmpeg installed, but it is an old version and usually make very low quality preprocessing. Please install newest ffmpeg following: |
|
```sh |
|
wget https://johnvansickle.com/ffmpeg/releases/ffmpeg-release-amd64-static.tar.xz |
|
tar xvf ffmpeg-release-amd64-static.tar.xz |
|
rm ffmpeg-release-amd64-static.tar.xz |
|
mv ffmpeg-7.0.1-amd64-static ffmpeg |
|
``` |
|
|
|
If you want to try our model with the audio in real-time streaming, please also clone ChatTTS. |
|
|
|
```sh |
|
pip install omegaconf vocos vector_quantize_pytorch cython |
|
git clone git+https://github.com/2noise/ChatTTS |
|
mv ChatTTS demo/rendering/ |
|
``` |
|
|
|
- Launch the gradio demo locally with: |
|
```sh |
|
python -m demo.app --resume_from_checkpoint chenjoya/videollm-online-8b-v1plus |
|
``` |
|
|
|
- Or launch the CLI locally with: |
|
```sh |
|
python -m demo.cli --resume_from_checkpoint chenjoya/videollm-online-8b-v1plus |
|
``` |
|
|
|
## Citation |
|
|
|
``` |
|
@inproceedings{videollm-online, |
|
author = {Joya Chen and Zhaoyang Lv and Shiwei Wu and Kevin Qinghong Lin and Chenan Song and Difei Gao and Jia-Wei Liu and Ziteng Gao and Dongxing Mao and Mike Zheng Shou}, |
|
title = {VideoLLM-online: Online Video Large Language Model for Streaming Video}, |
|
booktitle = {CVPR}, |
|
year = {2024}, |
|
} |
|
``` |