|
--- |
|
license: apache-2.0 |
|
--- |
|
|
|
# Omni-VideoAssistant |
|
This is a Video Question Answering Large Language model. |
|
[code base](https://github.com/wanghao-cst/Omni-VideoAssistant). |
|
|
|
## π Updates |
|
* **[2023.12.09]** π€[Hugging Face](https://huggingface.co/harvey2333/omni_video_assistant_6_1) **A Better Model V6.1** are available now! Welcome to **watch** this repository for the latest updates. |
|
* **[2023.12.06]** Gradio & CLI **Inference Demo** are available now. |
|
* **[2023.12.01]** π€[Hugging Face](https://huggingface.co/harvey2333/omni_video_assistant_5_3) **Preview Model** are available now! |
|
|
|
<details open><summary>π‘ I also have other video-language projects that may interest you β¨. </summary><p> |
|
<!-- may --> |
|
|
|
> [**OmniDataComposer: A Unified Data Structure for Multimodal Data Fusion and Infinite Data Generation**](https://arxiv.org/abs/2308.04126) <br> |
|
> Dongyang Yu, Shihao Wang, Yuan Fang, Wangpeng An <br> |
|
[![github](https://img.shields.io/badge/-Github-black?logo=github)](https://github.com/shajiayu1/MVCE/) [![arXiv](https://img.shields.io/badge/Arxiv-2310.01852-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2308.04126) <br></p></details> |
|
|
|
|
|
## π¨ Preparation |
|
```bash |
|
git clone https://github.com/wanghao-cst/Omni-VideoAssistant |
|
cd Omni-VideoAssistant |
|
``` |
|
```shell |
|
conda create -n omni python=3.10 -y |
|
conda activate omni |
|
pip install --upgrade pip |
|
pip install -e . |
|
``` |
|
|
|
## π Start here |
|
### Download Omni Preview Model |
|
Download for CLI inference only, gradio web UI will download it automatically. |
|
[Omni Preview Model 6.1](https://huggingface.co/harvey2333/omni_video_assistant_6_1) |
|
|
|
### Inference in Gradio Web UI |
|
|
|
```Shell |
|
CUDA_VISIBLE_DEVICES=0 python -m llava.serve.gradio_demo |
|
``` |
|
<p align="left"> |
|
<img src="assets/gradio_demo.png" width=100%> |
|
</p> |
|
|
|
### Inference in CLI |
|
``` |
|
CUDA_VISIBLE_DEVICES=0 python -m llava.eval.run_omni \ |
|
--model-path "path to omni checkpoints" \ |
|
--image-file "llava/serve/examples/extreme_ironing.jpg" \ |
|
--query "What is unusual about this image?" |
|
CUDA_VISIBLE_DEVICES=0 python -m llava.eval.run_omni \ |
|
--model-path "path to omni checkpoints" \ |
|
--video-file "llava/serve/examples/0A8CF.mp4" \ |
|
--query "Describe the activity in the video" |
|
``` |
|
|
|
## π₯ Results Comparision (based on model 5.3, evaluation on 6.1 is doing) |
|
### Image understanding |
|
<p align="left"> |
|
<img src="assets/val_img.png" width=100%> |
|
</p> |
|
|
|
### Video understanding |
|
<p align="left"> |
|
<img src="assets/val_vid.png" width=100%> |
|
</p> |
|
|
|
|
|
## π Acknowledgment |
|
|
|
This work is based on [MVCE for unlimited training data generation.](https://github.com/shajiayu1/MVCE/), [LLaVA for pretrained model](https://github.com/haotian-liu/LLaVA/) |