TIGER-Lab
/

VideoScore

Visual Question Answering

text-classification

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

hexuan21 commited on Nov 28, 2024

Commit

3e4da21

·

verified ·

1 Parent(s): e6b6996

Update README.md

Files changed (1) hide show

README.md +6 -7

README.md CHANGED Viewed

@@ -21,24 +21,23 @@ pipeline_tag: visual-question-answering
 and trained on [VideoFeedback](https://huggingface.co/datasets/TIGER-Lab/VideoFeedback),
 a large video evaluation dataset with multi-aspect human scores.
-- VideoScore can reach 75+ Spearman correlation with humans on VideoEval-test, surpassing all the MLLM-prompting methods and feature-based metrics.
-- VideoScore also beat the best baselines on other three benchmarks EvalCrafter, GenAI-Bench and VBench, showing high alignment with human evaluations.
 - **This is the regression version of VideoScore**
 ## Evaluation Results
-We test our video evaluation model VideoScore on VideoEval-test, EvalCrafter, GenAI-Bench and VBench.
 For the first two benchmarks, we take Spearman corrleation between model's output and human ratings
 averaged among all the evaluation aspects as indicator.
 For GenAI-Bench and VBench, which include human preference data among two or more videos,
 we employ the model's output to predict preferences and use pairwise accuracy as the performance indicator.
-- We use [VideoScore](https://huggingface.co/TIGER-Lab/VideoScore) trained on the entire VideoFeedback dataset
-for VideoFeedback-test set, while for other three benchmarks.
-- We use [VideoScore-anno-only](https://huggingface.co/TIGER-Lab/VideoScore-anno-only) trained on VideoFeedback dataset
 excluding the real videos.
 The evaluation results are shown below:

 and trained on [VideoFeedback](https://huggingface.co/datasets/TIGER-Lab/VideoFeedback),
 a large video evaluation dataset with multi-aspect human scores.
+- VideoScore can reach 75+ Spearman correlation with humans on VideoFeedback-test, surpassing all the MLLM-prompting methods and feature-based metrics.
+VideoScore also beat the best baselines on other three benchmarks EvalCrafter, GenAI-Bench and VBench, showing high alignment with human evaluations.
+For the data details of four benchmarks, please refer to [VideoScore-Bench](https://huggingface.co/datasets/TIGER-Lab/VideoScore-Bench).
 - **This is the regression version of VideoScore**
 ## Evaluation Results
+We test our video evaluation model series VideoScore on VideoFeedback-test, EvalCrafter, GenAI-Bench and VBench.
 For the first two benchmarks, we take Spearman corrleation between model's output and human ratings
 averaged among all the evaluation aspects as indicator.
 For GenAI-Bench and VBench, which include human preference data among two or more videos,
 we employ the model's output to predict preferences and use pairwise accuracy as the performance indicator.
+- For the benchmark VideoFeedback-test, We use [VideoScore](https://huggingface.co/TIGER-Lab/VideoScore) trained on the entire VideoFeedback dataset.
+- For other three benchmarks GenAI-Bench, VBench and EvalCrafter, We use [VideoScore-anno-only](https://huggingface.co/TIGER-Lab/VideoScore-anno-only) trained on VideoFeedback dataset
 excluding the real videos.
 The evaluation results are shown below: