xuanricheng commited on
Commit
175efb2
·
1 Parent(s): f97f2b7

update about

Browse files
Files changed (3) hide show
  1. README.md +1 -1
  2. src/display/about.py +23 -111
  3. src/display/utils.py +6 -6
README.md CHANGED
@@ -1,5 +1,5 @@
1
  ---
2
- title: Chinese Open LLM Leaderboard
3
  emoji: 🏆
4
  colorFrom: green
5
  colorTo: indigo
 
1
  ---
2
+ title: Open Chinese LLM Leaderboard
3
  emoji: 🏆
4
  colorFrom: green
5
  colorTo: indigo
src/display/about.py CHANGED
@@ -1,29 +1,37 @@
1
  from src.display.utils import ModelType
2
 
3
- TITLE = """<h1 align="center" id="space-title">🤗 Open Chinese LLM Leaderboard</h1>"""
4
 
5
  INTRODUCTION_TEXT = """
6
- 📐 The 🤗 Open Chinese LLM Leaderboard aims to track, rank and evaluate open LLMs and chatbots.
7
- This leaderboard is subset of the [FlagEval](https://flageval.baai.ac.cn/)
 
 
 
 
 
8
 
9
- 🤗 Submit a model for automated evaluation on the 🤗 GPU cluster on the "Submit" page!
10
- The leaderboard's backend runs the great [Eleuther AI Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) - read more details in the "About" page!
11
  """
12
 
13
  LLM_BENCHMARKS_TEXT = f"""
14
  # Context
15
- With the plethora of large language models (LLMs) and chatbots being released week upon week, often with grandiose claims of their performance, it can be hard to filter out the genuine progress that is being made by the open-source community and which model is the current state of the art.
 
 
 
 
16
 
17
  ## How it works
18
 
19
  📈 We evaluate models on 7 key benchmarks using the <a href="https://github.com/EleutherAI/lm-evaluation-harness" target="_blank"> Eleuther AI Language Model Evaluation Harness </a>, a unified framework to test generative language models on a large number of different evaluation tasks.
20
 
21
- - <a href="https://arxiv.org/abs/1803.05457" target="_blank"> AI2 Reasoning Challenge </a> (25-shot) - a set of grade-school science questions.
22
  - <a href="https://arxiv.org/abs/1905.07830" target="_blank"> HellaSwag </a> (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
23
- - <a href="https://arxiv.org/abs/2009.03300" target="_blank"> MMLU </a> (5-shot) - a test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.
24
  - <a href="https://arxiv.org/abs/2109.07958" target="_blank"> TruthfulQA </a> (0-shot) - a test to measure a model's propensity to reproduce falsehoods commonly found online. Note: TruthfulQA in the Harness is actually a minima a 6-shots task, as it is prepended by 6 examples systematically, even when launched using 0 for the number of few-shot examples.
25
  - <a href="https://arxiv.org/abs/1907.10641" target="_blank"> Winogrande </a> (5-shot) - an adversarial and difficult Winograd benchmark at scale, for commonsense reasoning.
26
  - <a href="https://arxiv.org/abs/2110.14168" target="_blank"> GSM8k </a> (5-shot) - diverse grade school math word problems to measure a model's ability to solve multi-step mathematical reasoning problems.
 
 
27
 
28
  For all these evaluations, a higher score is a better score.
29
  We chose these benchmarks as they test a variety of reasoning and general knowledge across a wide variety of fields in 0-shot and few-shot settings.
@@ -43,12 +51,13 @@ The total batch size we get for models which fit on one A100 node is 8 (8 GPUs *
43
  *You can expect results to vary slightly for different batch sizes because of padding.*
44
 
45
  The tasks and few shots parameters are:
46
- - ARC: 25-shot, *arc-challenge* (`acc_norm`)
47
- - HellaSwag: 10-shot, *hellaswag* (`acc_norm`)
48
- - TruthfulQA: 0-shot, *truthfulqa-mc* (`mc2`)
49
- - MMLU: 5-shot, *hendrycksTest-abstract_algebra,hendrycksTest-anatomy,hendrycksTest-astronomy,hendrycksTest-business_ethics,hendrycksTest-clinical_knowledge,hendrycksTest-college_biology,hendrycksTest-college_chemistry,hendrycksTest-college_computer_science,hendrycksTest-college_mathematics,hendrycksTest-college_medicine,hendrycksTest-college_physics,hendrycksTest-computer_security,hendrycksTest-conceptual_physics,hendrycksTest-econometrics,hendrycksTest-electrical_engineering,hendrycksTest-elementary_mathematics,hendrycksTest-formal_logic,hendrycksTest-global_facts,hendrycksTest-high_school_biology,hendrycksTest-high_school_chemistry,hendrycksTest-high_school_computer_science,hendrycksTest-high_school_european_history,hendrycksTest-high_school_geography,hendrycksTest-high_school_government_and_politics,hendrycksTest-high_school_macroeconomics,hendrycksTest-high_school_mathematics,hendrycksTest-high_school_microeconomics,hendrycksTest-high_school_physics,hendrycksTest-high_school_psychology,hendrycksTest-high_school_statistics,hendrycksTest-high_school_us_history,hendrycksTest-high_school_world_history,hendrycksTest-human_aging,hendrycksTest-human_sexuality,hendrycksTest-international_law,hendrycksTest-jurisprudence,hendrycksTest-logical_fallacies,hendrycksTest-machine_learning,hendrycksTest-management,hendrycksTest-marketing,hendrycksTest-medical_genetics,hendrycksTest-miscellaneous,hendrycksTest-moral_disputes,hendrycksTest-moral_scenarios,hendrycksTest-nutrition,hendrycksTest-philosophy,hendrycksTest-prehistory,hendrycksTest-professional_accounting,hendrycksTest-professional_law,hendrycksTest-professional_medicine,hendrycksTest-professional_psychology,hendrycksTest-public_relations,hendrycksTest-security_studies,hendrycksTest-sociology,hendrycksTest-us_foreign_policy,hendrycksTest-virology,hendrycksTest-world_religions* (average of all the results `acc`)
50
- - Winogrande: 5-shot, *winogrande* (`acc`)
51
- - GSM8k: 5-shot, *gsm8k* (`acc`)
 
52
 
53
  Side note on the baseline scores:
54
  - for log-likelihood evaluation, we select the random baseline
@@ -63,14 +72,9 @@ If there is no icon, we have not uploaded the information on the model yet, feel
63
 
64
  "Flagged" indicates that this model has been flagged by the community, and should probably be ignored! Clicking the link will redirect you to the discussion about the model.
65
 
66
- ## Quantization
67
- To get more information about quantization, see:
68
- - 8 bits: [blog post](https://huggingface.co/blog/hf-bitsandbytes-integration), [paper](https://arxiv.org/abs/2208.07339)
69
- - 4 bits: [blog post](https://huggingface.co/blog/4bit-transformers-bitsandbytes), [paper](https://arxiv.org/abs/2305.14314)
70
 
71
  ## Useful links
72
  - [Community resources](https://huggingface.co/spaces/BAAI/open_cn_llm_leaderboard/discussions/174)
73
- - [Collection of best models](https://huggingface.co/collections/open-cn-llm-leaderboard/chinese-llm-leaderboard-best-models-65b0d4511dbd85fd0c3ad9cd)
74
  """
75
 
76
  FAQ_TEXT = """
@@ -170,96 +174,4 @@ If everything is done, check you can launch the EleutherAIHarness on your model
170
 
171
  CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
172
  CITATION_BUTTON_TEXT = r"""
173
- @misc{open-llm-leaderboard,
174
- author = {Edward Beeching and Clémentine Fourrier and Nathan Habib and Sheon Han and Nathan Lambert and Nazneen Rajani and Omar Sanseviero and Lewis Tunstall and Thomas Wolf},
175
- title = {Open LLM Leaderboard},
176
- year = {2023},
177
- publisher = {Hugging Face},
178
- howpublished = "\url{https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard}"
179
- }
180
- @software{eval-harness,
181
- author = {Gao, Leo and
182
- Tow, Jonathan and
183
- Biderman, Stella and
184
- Black, Sid and
185
- DiPofi, Anthony and
186
- Foster, Charles and
187
- Golding, Laurence and
188
- Hsu, Jeffrey and
189
- McDonell, Kyle and
190
- Muennighoff, Niklas and
191
- Phang, Jason and
192
- Reynolds, Laria and
193
- Tang, Eric and
194
- Thite, Anish and
195
- Wang, Ben and
196
- Wang, Kevin and
197
- Zou, Andy},
198
- title = {A framework for few-shot language model evaluation},
199
- month = sep,
200
- year = 2021,
201
- publisher = {Zenodo},
202
- version = {v0.0.1},
203
- doi = {10.5281/zenodo.5371628},
204
- url = {https://doi.org/10.5281/zenodo.5371628}
205
- }
206
- @misc{clark2018think,
207
- title={Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge},
208
- author={Peter Clark and Isaac Cowhey and Oren Etzioni and Tushar Khot and Ashish Sabharwal and Carissa Schoenick and Oyvind Tafjord},
209
- year={2018},
210
- eprint={1803.05457},
211
- archivePrefix={arXiv},
212
- primaryClass={cs.AI}
213
- }
214
- @misc{zellers2019hellaswag,
215
- title={HellaSwag: Can a Machine Really Finish Your Sentence?},
216
- author={Rowan Zellers and Ari Holtzman and Yonatan Bisk and Ali Farhadi and Yejin Choi},
217
- year={2019},
218
- eprint={1905.07830},
219
- archivePrefix={arXiv},
220
- primaryClass={cs.CL}
221
- }
222
- @misc{hendrycks2021measuring,
223
- title={Measuring Massive Multitask Language Understanding},
224
- author={Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt},
225
- year={2021},
226
- eprint={2009.03300},
227
- archivePrefix={arXiv},
228
- primaryClass={cs.CY}
229
- }
230
- @misc{lin2022truthfulqa,
231
- title={TruthfulQA: Measuring How Models Mimic Human Falsehoods},
232
- author={Stephanie Lin and Jacob Hilton and Owain Evans},
233
- year={2022},
234
- eprint={2109.07958},
235
- archivePrefix={arXiv},
236
- primaryClass={cs.CL}
237
- }
238
- @misc{DBLP:journals/corr/abs-1907-10641,
239
- title={{WINOGRANDE:} An Adversarial Winograd Schema Challenge at Scale},
240
- author={Keisuke Sakaguchi and Ronan Le Bras and Chandra Bhagavatula and Yejin Choi},
241
- year={2019},
242
- eprint={1907.10641},
243
- archivePrefix={arXiv},
244
- primaryClass={cs.CL}
245
- }
246
- @misc{DBLP:journals/corr/abs-2110-14168,
247
- title={Training Verifiers to Solve Math Word Problems},
248
- author={Karl Cobbe and
249
- Vineet Kosaraju and
250
- Mohammad Bavarian and
251
- Mark Chen and
252
- Heewoo Jun and
253
- Lukasz Kaiser and
254
- Matthias Plappert and
255
- Jerry Tworek and
256
- Jacob Hilton and
257
- Reiichiro Nakano and
258
- Christopher Hesse and
259
- John Schulman},
260
- year={2021},
261
- eprint={2110.14168},
262
- archivePrefix={arXiv},
263
- primaryClass={cs.CL}
264
- }
265
  """
 
1
  from src.display.utils import ModelType
2
 
3
+ TITLE = """<h1 align="center" id="space-title">Open Chinese LLM Leaderboard</h1>"""
4
 
5
  INTRODUCTION_TEXT = """
6
+ Open Chinese LLM Leaderboard 旨在跟踪、排名和评估开放式中文大语言模型(LLM)。本排行榜由FlagEval平台提供相应算力和运行环境。
7
+ 评估数据集是全部都是中文数据集以评估中文能力如需查看详情信息,请查阅‘关于’页面。
8
+ 如需对模型进行更全面的评测,可以登录FlagEval平台,体验更加完善的模型评测功能。
9
+
10
+ The Open Chinese LLM Leaderboard aims to track, rank, and evaluate open Chinese large language models (LLMs). This leaderboard is powered by the [FlagEval](https://flageval.baai.ac.cn/) platform, providing corresponding computational resources and runtime environment.
11
+ The evaluation dataset consists entirely of Chinese data to assess Chinese language proficiency. For more detailed information, please refer to the 'About' page.
12
+ For a more comprehensive evaluation of the model, you can log in to the [FlagEval](https://flageval.baai.ac.cn/) to experience more refined model evaluation functionalities
13
 
 
 
14
  """
15
 
16
  LLM_BENCHMARKS_TEXT = f"""
17
  # Context
18
+ Open Chinese LLM Leaderboard是中文大语言排行榜,我们希望能够推动更加开放的生态,让中文大语言模型开发者参与进来,为推动中文的大语言模型进步做出相应的贡献。
19
+ 为了实现公平性的目标,所有模型都在 FlagEval 平台上使用标准化 GPU 和统一环境进行评估,以确保公平性。
20
+
21
+ The Open Chinese LLM Leaderboard serves as a ranking platform for major Chinese language models. We aspire to foster a more inclusive ecosystem, inviting developers of Chinese LLMs to contribute to the advancement of the field.
22
+ In pursuit of fairness, all models undergo evaluation on the FlagEval platform using standardized GPU and uniform environments to ensure impartiality.
23
 
24
  ## How it works
25
 
26
  📈 We evaluate models on 7 key benchmarks using the <a href="https://github.com/EleutherAI/lm-evaluation-harness" target="_blank"> Eleuther AI Language Model Evaluation Harness </a>, a unified framework to test generative language models on a large number of different evaluation tasks.
27
 
28
+ - <a href="https://arxiv.org/abs/1803.05457" target="_blank"> ARC Challenge </a> (25-shot) - a set of grade-school science questions.
29
  - <a href="https://arxiv.org/abs/1905.07830" target="_blank"> HellaSwag </a> (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
 
30
  - <a href="https://arxiv.org/abs/2109.07958" target="_blank"> TruthfulQA </a> (0-shot) - a test to measure a model's propensity to reproduce falsehoods commonly found online. Note: TruthfulQA in the Harness is actually a minima a 6-shots task, as it is prepended by 6 examples systematically, even when launched using 0 for the number of few-shot examples.
31
  - <a href="https://arxiv.org/abs/1907.10641" target="_blank"> Winogrande </a> (5-shot) - an adversarial and difficult Winograd benchmark at scale, for commonsense reasoning.
32
  - <a href="https://arxiv.org/abs/2110.14168" target="_blank"> GSM8k </a> (5-shot) - diverse grade school math word problems to measure a model's ability to solve multi-step mathematical reasoning problems.
33
+ - <a href="https://flageval.baai.ac.cn/#/taskIntro?t=zh_qa" target="_blank"> C-SEM </a> (5-shot) - Semantic understanding is seen as a key cornerstone in the research and application of natural language processing. However, there is still a lack of publicly available benchmarks that approach from a linguistic perspective in the field of evaluating large Chinese language models.
34
+ - <a href="https://arxiv.org/abs/2306.09212" target="_blank"> CMMLU </a> (5-shot) - CMMLU is a comprehensive evaluation benchmark specifically designed to evaluate the knowledge and reasoning abilities of LLMs within the context of Chinese language and culture. CMMLU covers a wide range of subjects, comprising 67 topics that span from elementary to advanced professional levels.
35
 
36
  For all these evaluations, a higher score is a better score.
37
  We chose these benchmarks as they test a variety of reasoning and general knowledge across a wide variety of fields in 0-shot and few-shot settings.
 
51
  *You can expect results to vary slightly for different batch sizes because of padding.*
52
 
53
  The tasks and few shots parameters are:
54
+ - C-ARC: 25-shot, *arc-challenge* (`acc_norm`)
55
+ - C-HellaSwag: 10-shot, *hellaswag* (`acc_norm`)
56
+ - C-TruthfulQA: 0-shot, *truthfulqa-mc* (`mc2`)
57
+ - C-Winogrande: 5-shot, *winogrande* (`acc`)
58
+ - C-GSM8k: 5-shot, *gsm8k* (`acc`)
59
+ - C-SEM-V2: 5-shot, cmmlu* `acc`)
60
+ - CMMLU: 5-shot, cmmlu* `acc`)
61
 
62
  Side note on the baseline scores:
63
  - for log-likelihood evaluation, we select the random baseline
 
72
 
73
  "Flagged" indicates that this model has been flagged by the community, and should probably be ignored! Clicking the link will redirect you to the discussion about the model.
74
 
 
 
 
 
75
 
76
  ## Useful links
77
  - [Community resources](https://huggingface.co/spaces/BAAI/open_cn_llm_leaderboard/discussions/174)
 
78
  """
79
 
80
  FAQ_TEXT = """
 
174
 
175
  CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
176
  CITATION_BUTTON_TEXT = r"""
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
177
  """
src/display/utils.py CHANGED
@@ -14,13 +14,13 @@ class Task:
14
  col_name: str
15
 
16
  class Tasks(Enum):
17
- arc = Task("arc:challenge", "acc_norm", "C-ARC")
18
- hellaswag = Task("hellaswag", "acc_norm", "C-HellaSwag")
19
- truthfulqa = Task("truthfulqa:mc", "mc2", "C-TruthfulQA")
20
- winogrande = Task("winogrande", "acc", "C-Winogrande")
21
- gsm8k = Task("gsm8k", "acc", "C-GSM8K")
22
  c_sem = Task("c-sem-v2", "acc", "C-SEM")
23
- mmlu = Task("cmmlu", "acc", "C-MMLU")
24
 
25
  # These classes are for user facing column names,
26
  # to avoid having to change them all around the code
 
14
  col_name: str
15
 
16
  class Tasks(Enum):
17
+ arc = Task("c_arc_challenge", "acc_norm", "C-ARC")
18
+ hellaswag = Task("c_hellaswag", "acc_norm", "C-HellaSwag")
19
+ truthfulqa = Task("c_truthfulqa_mc", "mc2", "C-TruthfulQA")
20
+ winogrande = Task("c_winogrande", "acc", "C-Winogrande")
21
+ gsm8k = Task("c_gsm8k", "acc", "C-GSM8K")
22
  c_sem = Task("c-sem-v2", "acc", "C-SEM")
23
+ mmlu = Task("cmmlu", "acc_norm", "C-MMLU")
24
 
25
  # These classes are for user facing column names,
26
  # to avoid having to change them all around the code