tttoaster commited on
Commit
d0b41f3
·
1 Parent(s): 04a0fb2

Upload 15 files

Browse files
Files changed (1) hide show
  1. constants.py +2 -2
constants.py CHANGED
@@ -34,7 +34,7 @@ LEADERBORAD_INTRODUCTION = """# SEED-Bench Leaderboard
34
  SEED-Bench-1 consists of 19K multiple-choice questions with accurate human annotations for evaluating Multimodal LLMs, covering 12 evaluation dimensions including both the spatial and temporal understanding.
35
  Please refer to [SEED-Bench-1 paper](https://arxiv.org/abs/2307.16125) for more details.
36
 
37
- SEED-Bench-2 comprises 24K multiple-choice questions with accurate human anno- tations, which spans 27 dimensions, including the evalu- ation of both text and image generation.
38
  Please refer to [SEED-Bench-2 paper](https://arxiv.org/abs/2311.17092) for more details.
39
  """
40
 
@@ -104,7 +104,7 @@ TABLE_INTRODUCTION = """In the table below, we summarize each task performance o
104
  LEADERBORAD_INFO = """
105
  Based on powerful Large Language Models (LLMs), recent generative Multimodal Large Language Models (MLLMs) have gained prominence as a pivotal research area, exhibiting remarkable capability for both comprehension and generation.
106
  [SEED-Bench-1](https://arxiv.org/abs/2307.16125) consists of 19K multiple choice questions with accurate human annotations (x6 larger than existing benchmarks), which spans 12 evaluation dimensions including the comprehension of both the image and video modality.
107
- [SEED-Bench-2](https://arxiv.org/abs/2311.17092) comprises 24K multiple-choice questions with accurate human anno- tations, which spans 27 dimensions, including the evalu- ation of both text and image generation.
108
  We develop an advanced pipeline for generating multiple-choice questions that target specific evaluation dimensions, integrating both automatic filtering and manual verification processes.
109
  Multiple-choice questions with groundtruth options derived from human annotation enables an objective and efficient assessment of model performance, eliminating the need for human or GPT intervention during evaluation.
110
  By revealing the limitations of existing MLLMs through evaluation results, we aim for SEED-Bench to provide insights for motivating future research.
 
34
  SEED-Bench-1 consists of 19K multiple-choice questions with accurate human annotations for evaluating Multimodal LLMs, covering 12 evaluation dimensions including both the spatial and temporal understanding.
35
  Please refer to [SEED-Bench-1 paper](https://arxiv.org/abs/2307.16125) for more details.
36
 
37
+ SEED-Bench-2 comprises 24K multiple-choice questions with accurate human annotations, which spans 27 dimensions, including the evaluation of both text and image generation.
38
  Please refer to [SEED-Bench-2 paper](https://arxiv.org/abs/2311.17092) for more details.
39
  """
40
 
 
104
  LEADERBORAD_INFO = """
105
  Based on powerful Large Language Models (LLMs), recent generative Multimodal Large Language Models (MLLMs) have gained prominence as a pivotal research area, exhibiting remarkable capability for both comprehension and generation.
106
  [SEED-Bench-1](https://arxiv.org/abs/2307.16125) consists of 19K multiple choice questions with accurate human annotations (x6 larger than existing benchmarks), which spans 12 evaluation dimensions including the comprehension of both the image and video modality.
107
+ [SEED-Bench-2](https://arxiv.org/abs/2311.17092) comprises 24K multiple-choice questions with accurate human annotations, which spans 27 dimensions, including the evaluation of both text and image generation.
108
  We develop an advanced pipeline for generating multiple-choice questions that target specific evaluation dimensions, integrating both automatic filtering and manual verification processes.
109
  Multiple-choice questions with groundtruth options derived from human annotation enables an objective and efficient assessment of model performance, eliminating the need for human or GPT intervention during evaluation.
110
  By revealing the limitations of existing MLLMs through evaluation results, we aim for SEED-Bench to provide insights for motivating future research.