harvey2333's picture
Create README.md
a6fe732
|
raw
history blame
2.68 kB
metadata
license: apache-2.0

Omni-VideoAssistant

This is a Video Question Answering Large Language model. code base.

πŸ“ Updates

  • [2023.12.09] πŸ€—Hugging Face A Better Model V6.1 are available now! Welcome to watch this repository for the latest updates.
  • [2023.12.06] Gradio & CLI Inference Demo are available now.
  • [2023.12.01] πŸ€—Hugging Face Preview Model are available now!
πŸ’‘ I also have other video-language projects that may interest you ✨.

OmniDataComposer: A Unified Data Structure for Multimodal Data Fusion and Infinite Data Generation
Dongyang Yu, Shihao Wang, Yuan Fang, Wangpeng An
github arXiv

πŸ”¨ Preparation

git clone https://github.com/wanghao-cst/Omni-VideoAssistant
cd Omni-VideoAssistant
conda create -n omni python=3.10 -y
conda activate omni
pip install --upgrade pip
pip install -e .

🌟 Start here

Download Omni Preview Model

Download for CLI inference only, gradio web UI will download it automatically. Omni Preview Model 6.1

Inference in Gradio Web UI

CUDA_VISIBLE_DEVICES=0 python -m  llava.serve.gradio_demo

Inference in CLI

CUDA_VISIBLE_DEVICES=0 python -m llava.eval.run_omni \
    --model-path "path to omni checkpoints" \
    --image-file "llava/serve/examples/extreme_ironing.jpg" \
    --query "What is unusual about this image?"
CUDA_VISIBLE_DEVICES=0 python -m llava.eval.run_omni \
    --model-path "path to omni checkpoints" \
    --video-file "llava/serve/examples/0A8CF.mp4" \
    --query "Describe the activity in the video"

πŸ”₯ Results Comparision (based on model 5.3, evaluation on 6.1 is doing)

Image understanding

Video understanding

😊 Acknowledgment

This work is based on MVCE for unlimited training data generation., LLaVA for pretrained model