ZhangYuanhan commited on
Commit
60221b9
·
verified ·
1 Parent(s): 293d882

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +54 -6
README.md CHANGED
@@ -5,9 +5,39 @@ license: apache-2.0
5
 
6
  <br>
7
 
8
- # LLaVA-Next-Video Model Card
9
 
10
- ## Model details
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
 
12
  **Model type:**
13
  <br>
@@ -23,14 +53,14 @@ LLaVA-NeXT-Video-32B-Qwen was trained in June 2024.
23
  <br>
24
  https://github.com/LLaVA-VL/LLaVA-NeXT
25
 
26
- ## License
27
  [Qwen/Qwen1.5-32B](https://huggingface.co/Qwen/Qwen1.5-32B) license.
28
 
29
 
30
- ## Where to send questions or comments about the model
31
  https://github.com/LLaVA-VL/LLaVA-NeXT/issues
32
 
33
- ## Intended use
34
  **Primary intended uses:**
35
  <br>
36
  The primary use of LLaVA is research on large multimodal models and chatbots.
@@ -39,7 +69,7 @@ The primary use of LLaVA is research on large multimodal models and chatbots.
39
  <br>
40
  The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.
41
 
42
- ## Training dataset
43
 
44
  ### Image
45
  - 558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP.
@@ -50,3 +80,21 @@ The primary intended users of the model are researchers and hobbyists in compute
50
  ### Video
51
  - 830k data
52
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
 
6
  <br>
7
 
8
+ ## LLaVA-NeXT-Video is upgraded 🚀
9
 
10
+ In our [LLaVA-Video blog](https://llava-vl.github.io/blog/2024-04-30-llava-next-video/) released this April, we shared two key observations:
11
+ - 🎬 AnyRes provides a shared and flexible representation between images and videos, and thus accommodates capability transfer between the two most common vision signals. Therefore, stronger image LMMs can naturally lead to stronger zero-shot video LMMs.
12
+ - 🗂️ There is a lack of high-quality language-video data, including video instruction-following data, and thus naive tuning on existing public data at that time results in performance degradation. Therefore, there is an urgent need to build high-quality video captions and QA datasets to train LMMs for improved video performance.
13
+
14
+ Based on the insights, the new LLaVA-NeXT-Video in this release improves from two aspects:
15
+
16
+ - 🎬 A stronger image LMMs ([LLaVA-NeXT-32B-Qwen](https://huggingface.co/lmms-lab/llava-next-qwen-32b)), which is built by initializing from Qwen-1.5 32B LLM. We further initialize our video training from this image checkpoint.
17
+ - 🗂️ A new high-quality video dataset with 830k samples. It is combined with LLaVA-1.6 image training data, and applying the same image-video mixed training procedure leads to the new video model.
18
+ The new model achieves the best open-source performance in several video benchmarks including [Video-MME](https://video-mme.github.io/home_page.html#leaderboard).
19
+
20
+ ### Resources
21
+ - **Inference Script**:
22
+ ```bash
23
+ bash scripts/video/demo/video_demo.sh lmms-lab/LLaVA-NeXT-Video-32B-Qwen 32 2 average after grid True playground/demo/xU25MMA2N4aVtYay.mp4
24
+ ```
25
+
26
+ ### Evaluation Results
27
+ | Model | NextQA-MC | video-mme(overall) | | Egochema | Perception Test (val) |
28
+ |-----------------------------|-----------|--------------------|--------|----------|------------------------|
29
+ | | | w/o subs | w subs | | |
30
+ | **Proprietary** | | | | | |
31
+ | GPT-4o | - | 71.9 | 77.2 | 72.2 | - |
32
+ | Gemini 1.5 Pro | - | 75.0 | 81.3 | 72.2 | - |
33
+ | **Open-Source** | | | | | |
34
+ | VideoLLaMA 2 (8x7B) | 76.3* | 47.9 | 50.3 | 53.3 | 51.2* |
35
+ | VILA-1.5-34B | 67.89* | 60.1 | 61.1 | 58.04* | 54 |
36
+ | LLaVA-NeXT-Video (Qwen-32B) | 77.31 | 60.2 | 63.0 | 60.85 | 59.38 |
37
+
38
+ _*Results are reproduced by [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval). Please refer to the lmms-eval to reproduce the results._
39
+
40
+ ### Model details
41
 
42
  **Model type:**
43
  <br>
 
53
  <br>
54
  https://github.com/LLaVA-VL/LLaVA-NeXT
55
 
56
+ ### License
57
  [Qwen/Qwen1.5-32B](https://huggingface.co/Qwen/Qwen1.5-32B) license.
58
 
59
 
60
+ ### Where to send questions or comments about the model
61
  https://github.com/LLaVA-VL/LLaVA-NeXT/issues
62
 
63
+ ### Intended use
64
  **Primary intended uses:**
65
  <br>
66
  The primary use of LLaVA is research on large multimodal models and chatbots.
 
69
  <br>
70
  The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.
71
 
72
+ ### Training dataset
73
 
74
  ### Image
75
  - 558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP.
 
80
  ### Video
81
  - 830k data
82
 
83
+ ### Citations
84
+ ```bibtex
85
+ @misc{zhang2024llavanextvideo,
86
+ title={LLaVA-NeXT: A Strong Zero-shot Video Understanding Model},
87
+ url={https://llava-vl.github.io/blog/2024-04-30-llava-next-video/},
88
+ author={Zhang, Yuanhan and Li, Bo and Liu, haotian and Lee, Yong jae and Gui, Liangke and Fu, Di and Feng, Jiashi and Liu, Ziwei and Li, Chunyuan},
89
+ month={April},
90
+ year={2024}
91
+ }
92
+
93
+ @misc{li2024llavanext-interleave,
94
+ title={LLaVA-NeXT: Tackling Multi-image, Video, and 3D in Large Multimodal Models},
95
+ url={https://llava-vl.github.io/blog/2024-06-16-llava-next-interleave/},
96
+ author={Li, Feng and Zhang, Renrui and Zhang, Hao and Zhang, Yuanhan and Li, Bo and Li, Wei and Ma, Zejun and Li, Chunyuan},
97
+ month={June},
98
+ year={2024}
99
+ }
100
+