hexuan21 commited on
Commit
3e4da21
·
verified ·
1 Parent(s): e6b6996

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -7
README.md CHANGED
@@ -21,24 +21,23 @@ pipeline_tag: visual-question-answering
21
  and trained on [VideoFeedback](https://huggingface.co/datasets/TIGER-Lab/VideoFeedback),
22
  a large video evaluation dataset with multi-aspect human scores.
23
 
24
- - VideoScore can reach 75+ Spearman correlation with humans on VideoEval-test, surpassing all the MLLM-prompting methods and feature-based metrics.
25
-
26
- - VideoScore also beat the best baselines on other three benchmarks EvalCrafter, GenAI-Bench and VBench, showing high alignment with human evaluations.
27
 
28
  - **This is the regression version of VideoScore**
29
 
30
  ## Evaluation Results
31
 
32
- We test our video evaluation model VideoScore on VideoEval-test, EvalCrafter, GenAI-Bench and VBench.
33
  For the first two benchmarks, we take Spearman corrleation between model's output and human ratings
34
  averaged among all the evaluation aspects as indicator.
35
  For GenAI-Bench and VBench, which include human preference data among two or more videos,
36
  we employ the model's output to predict preferences and use pairwise accuracy as the performance indicator.
37
 
38
- - We use [VideoScore](https://huggingface.co/TIGER-Lab/VideoScore) trained on the entire VideoFeedback dataset
39
- for VideoFeedback-test set, while for other three benchmarks.
40
 
41
- - We use [VideoScore-anno-only](https://huggingface.co/TIGER-Lab/VideoScore-anno-only) trained on VideoFeedback dataset
42
  excluding the real videos.
43
 
44
  The evaluation results are shown below:
 
21
  and trained on [VideoFeedback](https://huggingface.co/datasets/TIGER-Lab/VideoFeedback),
22
  a large video evaluation dataset with multi-aspect human scores.
23
 
24
+ - VideoScore can reach 75+ Spearman correlation with humans on VideoFeedback-test, surpassing all the MLLM-prompting methods and feature-based metrics.
25
+ VideoScore also beat the best baselines on other three benchmarks EvalCrafter, GenAI-Bench and VBench, showing high alignment with human evaluations.
26
+ For the data details of four benchmarks, please refer to [VideoScore-Bench](https://huggingface.co/datasets/TIGER-Lab/VideoScore-Bench).
27
 
28
  - **This is the regression version of VideoScore**
29
 
30
  ## Evaluation Results
31
 
32
+ We test our video evaluation model series VideoScore on VideoFeedback-test, EvalCrafter, GenAI-Bench and VBench.
33
  For the first two benchmarks, we take Spearman corrleation between model's output and human ratings
34
  averaged among all the evaluation aspects as indicator.
35
  For GenAI-Bench and VBench, which include human preference data among two or more videos,
36
  we employ the model's output to predict preferences and use pairwise accuracy as the performance indicator.
37
 
38
+ - For the benchmark VideoFeedback-test, We use [VideoScore](https://huggingface.co/TIGER-Lab/VideoScore) trained on the entire VideoFeedback dataset.
 
39
 
40
+ - For other three benchmarks GenAI-Bench, VBench and EvalCrafter, We use [VideoScore-anno-only](https://huggingface.co/TIGER-Lab/VideoScore-anno-only) trained on VideoFeedback dataset
41
  excluding the real videos.
42
 
43
  The evaluation results are shown below: