Update README.md
Browse files
README.md
CHANGED
@@ -21,24 +21,23 @@ pipeline_tag: visual-question-answering
|
|
21 |
and trained on [VideoFeedback](https://huggingface.co/datasets/TIGER-Lab/VideoFeedback),
|
22 |
a large video evaluation dataset with multi-aspect human scores.
|
23 |
|
24 |
-
- VideoScore can reach 75+ Spearman correlation with humans on
|
25 |
-
|
26 |
-
|
27 |
|
28 |
- **This is the regression version of VideoScore**
|
29 |
|
30 |
## Evaluation Results
|
31 |
|
32 |
-
We test our video evaluation model VideoScore on
|
33 |
For the first two benchmarks, we take Spearman corrleation between model's output and human ratings
|
34 |
averaged among all the evaluation aspects as indicator.
|
35 |
For GenAI-Bench and VBench, which include human preference data among two or more videos,
|
36 |
we employ the model's output to predict preferences and use pairwise accuracy as the performance indicator.
|
37 |
|
38 |
-
- We use [VideoScore](https://huggingface.co/TIGER-Lab/VideoScore) trained on the entire VideoFeedback dataset
|
39 |
-
for VideoFeedback-test set, while for other three benchmarks.
|
40 |
|
41 |
-
- We use [VideoScore-anno-only](https://huggingface.co/TIGER-Lab/VideoScore-anno-only) trained on VideoFeedback dataset
|
42 |
excluding the real videos.
|
43 |
|
44 |
The evaluation results are shown below:
|
|
|
21 |
and trained on [VideoFeedback](https://huggingface.co/datasets/TIGER-Lab/VideoFeedback),
|
22 |
a large video evaluation dataset with multi-aspect human scores.
|
23 |
|
24 |
+
- VideoScore can reach 75+ Spearman correlation with humans on VideoFeedback-test, surpassing all the MLLM-prompting methods and feature-based metrics.
|
25 |
+
VideoScore also beat the best baselines on other three benchmarks EvalCrafter, GenAI-Bench and VBench, showing high alignment with human evaluations.
|
26 |
+
For the data details of four benchmarks, please refer to [VideoScore-Bench](https://huggingface.co/datasets/TIGER-Lab/VideoScore-Bench).
|
27 |
|
28 |
- **This is the regression version of VideoScore**
|
29 |
|
30 |
## Evaluation Results
|
31 |
|
32 |
+
We test our video evaluation model series VideoScore on VideoFeedback-test, EvalCrafter, GenAI-Bench and VBench.
|
33 |
For the first two benchmarks, we take Spearman corrleation between model's output and human ratings
|
34 |
averaged among all the evaluation aspects as indicator.
|
35 |
For GenAI-Bench and VBench, which include human preference data among two or more videos,
|
36 |
we employ the model's output to predict preferences and use pairwise accuracy as the performance indicator.
|
37 |
|
38 |
+
- For the benchmark VideoFeedback-test, We use [VideoScore](https://huggingface.co/TIGER-Lab/VideoScore) trained on the entire VideoFeedback dataset.
|
|
|
39 |
|
40 |
+
- For other three benchmarks GenAI-Bench, VBench and EvalCrafter, We use [VideoScore-anno-only](https://huggingface.co/TIGER-Lab/VideoScore-anno-only) trained on VideoFeedback dataset
|
41 |
excluding the real videos.
|
42 |
|
43 |
The evaluation results are shown below:
|