Spaces:
Running
Running
Ludwig Stumpp
commited on
Commit
·
e1aeb72
1
Parent(s):
5e1e4f6
Add HellaSwag Benchmark
Browse files
README.md
CHANGED
@@ -8,52 +8,53 @@ https://llm-leaderboard.streamlit.app/
|
|
8 |
|
9 |
## Leaderboard
|
10 |
|
11 |
-
| Model Name | Commercial Use? | Chatbot Arena Elo | HumanEval-Python (pass@1)
|
12 |
-
| -------------------------------------------------------------------------------------- | --------------- | ------------------------------------------------ |
|
13 |
-
| [alpaca-13b](https://crfm.stanford.edu/2023/03/13/alpaca.html) | no | [1008](https://lmsys.org/blog/2023-05-03-arena/) |
|
14 |
-
| [bloom-176b](https://huggingface.co/bigscience/bloom) | yes | | [0.155](https://huggingface.co/bigscience/bloom#results)
|
15 |
-
| [cerebras-gpt-7b](https://huggingface.co/cerebras/Cerebras-GPT-6.7B) | yes | |
|
16 |
-
| [cerebras-gpt-13b](https://huggingface.co/cerebras/Cerebras-GPT-13B) | yes | |
|
17 |
-
| [chatglm-6b](https://chatglm.cn/blog) | yes | [985](https://lmsys.org/blog/2023-05-03-arena/) |
|
18 |
-
| [chinchilla-70b](https://arxiv.org/abs/2203.15556v1) | no | |
|
19 |
-
| [code-cushman-001](https://arxiv.org/abs/2107.03374) | no | | [0.335](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | | | | |
|
20 |
-
| [code-davinci-002](https://arxiv.org/abs/2207.10397v2) | yes | | [0.658](https://arxiv.org/abs/2207.10397v2) | | | | |
|
21 |
-
| [codegen-16B-mono](https://huggingface.co/Salesforce/codegen-16B-mono) | yes | | [0.293](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | | | | |
|
22 |
-
| [codegen-16B-multi](https://huggingface.co/Salesforce/codegen-16B-multi) | yes | | [0.183](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | | | | |
|
23 |
-
| [codegx-13b](http://keg.cs.tsinghua.edu.cn/codegeex/) | no | | [0.229](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | | | | |
|
24 |
-
| [codex-12b](https://arxiv.org/abs/2107.03374v2) | no | | [0.288](https://arxiv.org/abs/2107.03374v2) | | | [0.685](https://arxiv.org/abs/2301.12652v2) | |
|
25 |
-
| [dolly-v2-12b](https://huggingface.co/databricks/dolly-v2-12b) | yes | [944](https://lmsys.org/blog/2023-05-03-arena/) |
|
26 |
-
| [eleuther-pythia-7b](https://huggingface.co/EleutherAI/pythia-6.9b) | yes | |
|
27 |
-
| [eleuther-pythia-12b](https://huggingface.co/EleutherAI/pythia-12b) | yes | |
|
28 |
-
| [fastchat-t5-3b](https://huggingface.co/lmsys/fastchat-t5-3b-v1.0) | yes | [951](https://lmsys.org/blog/2023-05-03-arena/) |
|
29 |
-
| [gal-120b](https://arxiv.org/abs/2211.09085v1) | no | |
|
30 |
-
| [gpt-3-175b](https://arxiv.org/abs/2005.14165) | no | |
|
31 |
-
| [gpt-3.5-175b](https://arxiv.org/abs/2303.08774v3) | yes | | [0.481](https://arxiv.org/abs/2303.08774v3) | [0.762](https://arxiv.org/abs/2303.08774v3) | | [0.700](https://arxiv.org/abs/2303.08774v3) | |
|
32 |
-
| [gpt-4](https://arxiv.org/abs/2303.08774v3) | yes | | [0.670](https://arxiv.org/abs/2303.08774v3) | | | [0.864](https://arxiv.org/abs/2303.08774v3) | |
|
33 |
-
| [gpt-neox-20b](https://huggingface.co/EleutherAI/gpt-neox-20b) | yes | |
|
34 |
-
| [gpt-j-6b](https://huggingface.co/EleutherAI/gpt-j-6b) | yes | |
|
35 |
-
| [koala-13b](https://bair.berkeley.edu/blog/2023/04/03/koala/) | no | [1082](https://lmsys.org/blog/2023-05-03-arena/) |
|
36 |
-
| [llama-7b](https://arxiv.org/abs/2302.13971) | no | | [0.105](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | [0.738](https://www.mosaicml.com/blog/mpt-7b) | [0.302](https://www.mosaicml.com/blog/mpt-7b) | | [0.443](https://www.mosaicml.com/blog/mpt-7b) |
|
37 |
-
| [llama-13b](https://arxiv.org/abs/2302.13971) | no | [932](https://lmsys.org/blog/2023-05-03-arena/) | [0.158](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | | | | |
|
38 |
-
| [llama-33b](https://arxiv.org/abs/2302.13971) | no | | [0.217](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | | | | |
|
39 |
-
| [llama-65b](https://arxiv.org/abs/2302.13971) | no | | [0.237](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | | | [0.634](https://arxiv.org/abs/2302.13971v1) | |
|
40 |
-
| [mpt-7b](https://huggingface.co/mosaicml/mpt-7b) | yes | |
|
41 |
-
| [oasst-pythia-12b](https://huggingface.co/OpenAssistant/pythia-12b-pre-v8-12.5k-steps) | yes | [1065](https://lmsys.org/blog/2023-05-03-arena/) |
|
42 |
-
| [opt-7b](https://huggingface.co/facebook/opt-6.7b) | no | |
|
43 |
-
| [opt-13b](https://huggingface.co/facebook/opt-13b) | no | |
|
44 |
-
| [palm-540b](https://arxiv.org/abs/2204.02311v5) | no | | [0.262](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | [0.779](https://arxiv.org/abs/2204.02311v5) | | [0.693](https://arxiv.org/abs/2204.02311v5) | |
|
45 |
-
| [stablelm-base-alpha-7b](https://huggingface.co/stabilityai/stablelm-base-alpha-7b) | yes | |
|
46 |
-
| [stablelm-tuned-alpha-7b](https://huggingface.co/stabilityai/stablelm-tuned-alpha-7b) | no | [858](https://lmsys.org/blog/2023-05-03-arena/) |
|
47 |
-
| [starcoder-base-16b](https://huggingface.co/bigcode/starcoderbase) | yes | | [0.304](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | | | | |
|
48 |
-
| [starcoder-16b](https://huggingface.co/bigcode/starcoder) | yes | | [0.336](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | | | | |
|
49 |
-
| [starcoder-16b (prompted)](https://huggingface.co/bigcode/starcoder) | yes | | [0.408](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | | | | |
|
50 |
-
| [vicuna-13b](https://huggingface.co/lmsys/vicuna-13b-delta-v0) | no | [1169](https://lmsys.org/blog/2023-05-03-arena/) |
|
51 |
|
52 |
## Benchmarks
|
53 |
|
54 |
| Benchmark Name | Author | Link | Description |
|
55 |
| ----------------- | ---------------- | ---------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
56 |
| Chatbot Arena Elo | LMSYS | https://lmsys.org/blog/2023-05-03-arena/ | "In this blog post, we introduce Chatbot Arena, an LLM benchmark platform featuring anonymous randomized battles in a crowdsourced manner. Chatbot Arena adopts the Elo rating system, which is a widely-used rating system in chess and other competitive games." (Source: https://lmsys.org/blog/2023-05-03-arena/) |
|
|
|
57 |
| HumanEval | Chen et al. | https://arxiv.org/abs/2107.03374v2 | "It used to measure functional correctness for synthesizing programs from docstrings. It consists of 164 original programming problems, assessing language comprehension, algorithms, and simple mathematics, with some comparable to simple software interview questions." (Source: https://paperswithcode.com/dataset/humaneval) |
|
58 |
| LAMBADA | Paperno et al. | https://arxiv.org/abs/1606.06031 | "The LAMBADA evaluates the capabilities of computational models for text understanding by means of a word prediction task. LAMBADA is a collection of narrative passages sharing the characteristic that human subjects are able to guess their last word if they are exposed to the whole passage, but not if they only see the last sentence preceding the target word. To succeed on LAMBADA, computational models cannot simply rely on local context, but must be able to keep track of information in the broader discourse." (Source: https://huggingface.co/datasets/lambada) |
|
59 |
| MMLU | Hendrycks et al. | https://github.com/hendrycks/test | "The benchmark covers 57 subjects across STEM, the humanities, the social sciences, and more. It ranges in difficulty from an elementary level to an advanced professional level, and it tests both world knowledge and problem solving ability. Subjects range from traditional areas, such as mathematics and history, to more specialized areas like law and ethics. The granularity and breadth of the subjects makes the benchmark ideal for identifying a model’s blind spots." (Source: "https://paperswithcode.com/dataset/mmlu") |
|
|
|
8 |
|
9 |
## Leaderboard
|
10 |
|
11 |
+
| Model Name | Commercial Use? | Chatbot Arena Elo | HellaSwag (zero-shot) | HumanEval-Python (pass@1) | LAMBADA (zero-shot) | MMLU (zero-shot) | MMLU (few-shot) | TriviaQA (zero-shot) |
|
12 |
+
| -------------------------------------------------------------------------------------- | --------------- | ------------------------------------------------ | --------------------------------------------- | ------------------------------------------------------------------------------- | --------------------------------------------- | ---------------------------------------------------------------------------------------- | ------------------------------------------- | --------------------------------------------- |
|
13 |
+
| [alpaca-13b](https://crfm.stanford.edu/2023/03/13/alpaca.html) | no | [1008](https://lmsys.org/blog/2023-05-03-arena/) | | | | | | |
|
14 |
+
| [bloom-176b](https://huggingface.co/bigscience/bloom) | yes | | | [0.155](https://huggingface.co/bigscience/bloom#results) | | | | |
|
15 |
+
| [cerebras-gpt-7b](https://huggingface.co/cerebras/Cerebras-GPT-6.7B) | yes | | [0.636](https://www.mosaicml.com/blog/mpt-7b) | | [0.636](https://www.mosaicml.com/blog/mpt-7b) | [0.259](https://www.mosaicml.com/blog/mpt-7b) | | [0.141](https://www.mosaicml.com/blog/mpt-7b) |
|
16 |
+
| [cerebras-gpt-13b](https://huggingface.co/cerebras/Cerebras-GPT-13B) | yes | | [0.635](https://www.mosaicml.com/blog/mpt-7b) | | [0.635](https://www.mosaicml.com/blog/mpt-7b) | [0.258](https://www.mosaicml.com/blog/mpt-7b) | | [0.146](https://www.mosaicml.com/blog/mpt-7b) |
|
17 |
+
| [chatglm-6b](https://chatglm.cn/blog) | yes | [985](https://lmsys.org/blog/2023-05-03-arena/) | | | | | | |
|
18 |
+
| [chinchilla-70b](https://arxiv.org/abs/2203.15556v1) | no | | [0.808](https://arxiv.org/abs/2203.15556v1) | | [0.774](https://arxiv.org/abs/2203.15556v1) | | [0.675](https://arxiv.org/abs/2203.15556v1) | |
|
19 |
+
| [code-cushman-001](https://arxiv.org/abs/2107.03374) | no | | | [0.335](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | | | | |
|
20 |
+
| [code-davinci-002](https://arxiv.org/abs/2207.10397v2) | yes | | | [0.658](https://arxiv.org/abs/2207.10397v2) | | | | |
|
21 |
+
| [codegen-16B-mono](https://huggingface.co/Salesforce/codegen-16B-mono) | yes | | | [0.293](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | | | | |
|
22 |
+
| [codegen-16B-multi](https://huggingface.co/Salesforce/codegen-16B-multi) | yes | | | [0.183](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | | | | |
|
23 |
+
| [codegx-13b](http://keg.cs.tsinghua.edu.cn/codegeex/) | no | | | [0.229](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | | | | |
|
24 |
+
| [codex-12b](https://arxiv.org/abs/2107.03374v2) | no | | | [0.288](https://arxiv.org/abs/2107.03374v2) | | | [0.685](https://arxiv.org/abs/2301.12652v2) | |
|
25 |
+
| [dolly-v2-12b](https://huggingface.co/databricks/dolly-v2-12b) | yes | [944](https://lmsys.org/blog/2023-05-03-arena/) | | | | | | |
|
26 |
+
| [eleuther-pythia-7b](https://huggingface.co/EleutherAI/pythia-6.9b) | yes | | [0.667](https://www.mosaicml.com/blog/mpt-7b) | | [0.667](https://www.mosaicml.com/blog/mpt-7b) | [0.265](https://www.mosaicml.com/blog/mpt-7b) | | [0.198](https://www.mosaicml.com/blog/mpt-7b) |
|
27 |
+
| [eleuther-pythia-12b](https://huggingface.co/EleutherAI/pythia-12b) | yes | | [0.704](https://www.mosaicml.com/blog/mpt-7b) | | [0.704](https://www.mosaicml.com/blog/mpt-7b) | [0.253](https://www.mosaicml.com/blog/mpt-7b) | | [0.233](https://www.mosaicml.com/blog/mpt-7b) |
|
28 |
+
| [fastchat-t5-3b](https://huggingface.co/lmsys/fastchat-t5-3b-v1.0) | yes | [951](https://lmsys.org/blog/2023-05-03-arena/) | | | | | | |
|
29 |
+
| [gal-120b](https://arxiv.org/abs/2211.09085v1) | no | | | | | [0.526](https://paperswithcode.com/paper/galactica-a-large-language-model-for-science-1) | | |
|
30 |
+
| [gpt-3-175b](https://arxiv.org/abs/2005.14165) | no | | [0.789](https://arxiv.org/abs/2005.14165) | | | | [0.439](https://arxiv.org/abs/2005.14165) | |
|
31 |
+
| [gpt-3.5-175b](https://arxiv.org/abs/2303.08774v3) | yes | | | [0.481](https://arxiv.org/abs/2303.08774v3) | [0.762](https://arxiv.org/abs/2303.08774v3) | | [0.700](https://arxiv.org/abs/2303.08774v3) | |
|
32 |
+
| [gpt-4](https://arxiv.org/abs/2303.08774v3) | yes | | | [0.670](https://arxiv.org/abs/2303.08774v3) | | | [0.864](https://arxiv.org/abs/2303.08774v3) | |
|
33 |
+
| [gpt-neox-20b](https://huggingface.co/EleutherAI/gpt-neox-20b) | yes | | [0.719](https://www.mosaicml.com/blog/mpt-7b) | | [0.719](https://www.mosaicml.com/blog/mpt-7b) | [0.269](https://www.mosaicml.com/blog/mpt-7b) | [0.336](https://arxiv.org/abs/2204.06745v1) | [0.347](https://www.mosaicml.com/blog/mpt-7b) |
|
34 |
+
| [gpt-j-6b](https://huggingface.co/EleutherAI/gpt-j-6b) | yes | | [0.683](https://www.mosaicml.com/blog/mpt-7b) | | [0.683](https://www.mosaicml.com/blog/mpt-7b) | [0.261](https://www.mosaicml.com/blog/mpt-7b) | | [0.234](https://www.mosaicml.com/blog/mpt-7b) |
|
35 |
+
| [koala-13b](https://bair.berkeley.edu/blog/2023/04/03/koala/) | no | [1082](https://lmsys.org/blog/2023-05-03-arena/) | | | | | | |
|
36 |
+
| [llama-7b](https://arxiv.org/abs/2302.13971) | no | | [0.738](https://www.mosaicml.com/blog/mpt-7b) | [0.105](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | [0.738](https://www.mosaicml.com/blog/mpt-7b) | [0.302](https://www.mosaicml.com/blog/mpt-7b) | | [0.443](https://www.mosaicml.com/blog/mpt-7b) |
|
37 |
+
| [llama-13b](https://arxiv.org/abs/2302.13971) | no | [932](https://lmsys.org/blog/2023-05-03-arena/) | | [0.158](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | | | | |
|
38 |
+
| [llama-33b](https://arxiv.org/abs/2302.13971) | no | | | [0.217](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | | | | |
|
39 |
+
| [llama-65b](https://arxiv.org/abs/2302.13971) | no | | | [0.237](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | | | [0.634](https://arxiv.org/abs/2302.13971v1) | |
|
40 |
+
| [mpt-7b](https://huggingface.co/mosaicml/mpt-7b) | yes | | [0.761](https://www.mosaicml.com/blog/mpt-7b) | | [0.702](https://www.mosaicml.com/blog/mpt-7b) | [0.296](https://www.mosaicml.com/blog/mpt-7b) | | [0.343](https://www.mosaicml.com/blog/mpt-7b) |
|
41 |
+
| [oasst-pythia-12b](https://huggingface.co/OpenAssistant/pythia-12b-pre-v8-12.5k-steps) | yes | [1065](https://lmsys.org/blog/2023-05-03-arena/) | | | | | | |
|
42 |
+
| [opt-7b](https://huggingface.co/facebook/opt-6.7b) | no | | [0.677](https://www.mosaicml.com/blog/mpt-7b) | | [0.677](https://www.mosaicml.com/blog/mpt-7b) | [0.251](https://www.mosaicml.com/blog/mpt-7b) | | [0.227](https://www.mosaicml.com/blog/mpt-7b) |
|
43 |
+
| [opt-13b](https://huggingface.co/facebook/opt-13b) | no | | [0.692](https://www.mosaicml.com/blog/mpt-7b) | | [0.692](https://www.mosaicml.com/blog/mpt-7b) | [0.257](https://www.mosaicml.com/blog/mpt-7b) | | [0.282](https://www.mosaicml.com/blog/mpt-7b) |
|
44 |
+
| [palm-540b](https://arxiv.org/abs/2204.02311v5) | no | | [0.834](https://arxiv.org/abs/2204.02311v5) | [0.262](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | [0.779](https://arxiv.org/abs/2204.02311v5) | | [0.693](https://arxiv.org/abs/2204.02311v5) | |
|
45 |
+
| [stablelm-base-alpha-7b](https://huggingface.co/stabilityai/stablelm-base-alpha-7b) | yes | | [0.533](https://www.mosaicml.com/blog/mpt-7b) | | [0.533](https://www.mosaicml.com/blog/mpt-7b) | [0.251](https://www.mosaicml.com/blog/mpt-7b) | | [0.049](https://www.mosaicml.com/blog/mpt-7b) |
|
46 |
+
| [stablelm-tuned-alpha-7b](https://huggingface.co/stabilityai/stablelm-tuned-alpha-7b) | no | [858](https://lmsys.org/blog/2023-05-03-arena/) | | | | | | |
|
47 |
+
| [starcoder-base-16b](https://huggingface.co/bigcode/starcoderbase) | yes | | | [0.304](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | | | | |
|
48 |
+
| [starcoder-16b](https://huggingface.co/bigcode/starcoder) | yes | | | [0.336](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | | | | |
|
49 |
+
| [starcoder-16b (prompted)](https://huggingface.co/bigcode/starcoder) | yes | | | [0.408](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) | | | | |
|
50 |
+
| [vicuna-13b](https://huggingface.co/lmsys/vicuna-13b-delta-v0) | no | [1169](https://lmsys.org/blog/2023-05-03-arena/) | | | | | | |
|
51 |
|
52 |
## Benchmarks
|
53 |
|
54 |
| Benchmark Name | Author | Link | Description |
|
55 |
| ----------------- | ---------------- | ---------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
56 |
| Chatbot Arena Elo | LMSYS | https://lmsys.org/blog/2023-05-03-arena/ | "In this blog post, we introduce Chatbot Arena, an LLM benchmark platform featuring anonymous randomized battles in a crowdsourced manner. Chatbot Arena adopts the Elo rating system, which is a widely-used rating system in chess and other competitive games." (Source: https://lmsys.org/blog/2023-05-03-arena/) |
|
57 |
+
| HellaSwag | Zellers et al. | https://arxiv.org/abs/1905.07830v1 | "HellaSwag is a challenge dataset for evaluating commonsense NLI that is specially hard for state-of-the-art models, though its questions are trivial for humans (>95% accuracy)." (Source: https://paperswithcode.com/dataset/hellaswag) |
|
58 |
| HumanEval | Chen et al. | https://arxiv.org/abs/2107.03374v2 | "It used to measure functional correctness for synthesizing programs from docstrings. It consists of 164 original programming problems, assessing language comprehension, algorithms, and simple mathematics, with some comparable to simple software interview questions." (Source: https://paperswithcode.com/dataset/humaneval) |
|
59 |
| LAMBADA | Paperno et al. | https://arxiv.org/abs/1606.06031 | "The LAMBADA evaluates the capabilities of computational models for text understanding by means of a word prediction task. LAMBADA is a collection of narrative passages sharing the characteristic that human subjects are able to guess their last word if they are exposed to the whole passage, but not if they only see the last sentence preceding the target word. To succeed on LAMBADA, computational models cannot simply rely on local context, but must be able to keep track of information in the broader discourse." (Source: https://huggingface.co/datasets/lambada) |
|
60 |
| MMLU | Hendrycks et al. | https://github.com/hendrycks/test | "The benchmark covers 57 subjects across STEM, the humanities, the social sciences, and more. It ranges in difficulty from an elementary level to an advanced professional level, and it tests both world knowledge and problem solving ability. Subjects range from traditional areas, such as mathematics and history, to more specialized areas like law and ethics. The granularity and breadth of the subjects makes the benchmark ideal for identifying a model’s blind spots." (Source: "https://paperswithcode.com/dataset/mmlu") |
|