TIGER-Lab
/

VideoScore

@@ -19,7 +19,52 @@ pipeline_tag: visual-question-answering
 ![MantisScore](https://tiger-ai-lab.github.io/MantisScore/static/images/teaser.png)
 ## Introduction
-- MantisScore is a video quality evaluation model, trained on VideoEval[VideoEval](https://huggingface.co/datasets/TIGER-Lab/VideoEval),
 a large video evaluation dataset with multi-aspect human scores.
-- MantisScore trained on

 ![MantisScore](https://tiger-ai-lab.github.io/MantisScore/static/images/teaser.png)
 ## Introduction
+- MantisScore is a video quality evaluation model, taking [Mantis-8B-Idefics2](https://huggingface.co/TIGER-Lab/Mantis-8B-Idefics2) as base-model
+and trained on [VideoEval](https://huggingface.co/datasets/TIGER-Lab/VideoEval),
 a large video evaluation dataset with multi-aspect human scores.
+- MantisScore can reach 75+ Spearman correlation with humans on VideoEval-test, surpassing all the MLLM-prompting methods and feature-based metrics.
+- MantisScore also beat the best baselines on other three benchmarks EvalCrafter, GenAI-Bench and VBench, showing high alignment with human evaluations.
+## Performance
+### Evaluation Results on 4 benchmarks.
+We test our video evaluation model MantisScore on VideoEval-test, EvalCrafter, GenAI-Bench and VBench.
+For the first two benchmarks, we take Spearman corrleation between model's output and human ratings
+averaged among all the evaluation aspects as indicator.
+For GenAI-Bench and VBench, which include human preference data among two or more videos,
+we employ the model's output to predict preferences and use pairwise accuracy as the performance indicator.
+| metric           | Final Sum Score | VideoEval-test | EvalCrafter | GenAI-Bench | VBench |
+|------------------|----------------:|---------------:|------------:|-------------|--------|
+| MantisScore      |                 |                |             |             |        |
+| Gemini-1.5-Pro   |           158.8 |           22.1 |        22.9 |        60.9 |   52.9 |
+| Gemini-1.5-Flash |           157.5 |           20.8 |        17.3 |        67.1 |   52.3 |
+| GPT-4o           |           155.4 |           23.1 |        28.7 |        52.0 |   51.7 |
+| CLIP-sim         |           126.8 |            8.9 |        36.2 |        34.2 |   47.4 |
+| DINO-sim         |           121.3 |            7.5 |        32.1 |        38.5 |   43.3 |
+| SSIM-sim         |           118.0 |           13.4 |        26.9 |        34.1 |   43.5 |
+| CLIP-Score       |           114.4 |           -7.2 |        21.7 |        45.0 |   54.9 |
+| LLaVA-1.5-7B     |           108.3 |            8.5 |        10.5 |        49.9 |   39.4 |
+| LLaVA-1.6-7B     |            93.3 |           -3.1 |        13.2 |        44.5 |   38.7 |
+| X-CLIP-Score     |            92.9 |           -1.9 |        13.3 |        41.4 |   40.1 |
+| PIQE             |            78.3 |          -10.1 |        -1.2 |        34.5 |   55.1 |
+| BRISQUE          |            75.9 |          -20.3 |         3.9 |        38.5 |   53.7 |
+| SSIM-dyn         |            42.5 |           -5.5 |       -17.0 |        28.4 |   36.5 |
+| MES-dyn          |            36.7 |          -12.9 |       -26.4 |        31.4 |   44.5 |
+## Usage
+### Installation
+```bash
+pip install git+https://github.com/TIGER-AI-Lab/MantisScore.git
+```
+### Inference
+### Training
+MantisScore is trained on
+### Evaluation
+## Citation