Spaces:
Running
Running
Jae-Won Chung
commited on
Commit
·
d846882
1
Parent(s):
511ed5e
Better About tab
Browse files- LEADERBOARD.md +16 -12
LEADERBOARD.md
CHANGED
@@ -1,14 +1,23 @@
|
|
1 |
The goal of the ML.ENERGY Leaderboard is to give people a sense of how much **energy** LLMs would consume.
|
2 |
|
|
|
|
|
|
|
3 |
## Columns
|
4 |
|
5 |
-
- `gpu`: NVIDIA GPU model name.
|
6 |
- `task`: Name of the task. See *Tasks* below for details.
|
7 |
- `energy` (J): The average GPU energy consumed by the model to generate a response.
|
8 |
- `throughput` (token/s): The average number of tokens generated per second.
|
9 |
- `latency` (s): The average time it took for the model to generate a response.
|
10 |
- `response_length` (token): The average number of tokens in the model's response.
|
11 |
- `parameters`: The number of parameters the model has, in units of billion.
|
|
|
|
|
|
|
|
|
|
|
|
|
12 |
|
13 |
## Tasks
|
14 |
|
@@ -39,6 +48,7 @@ Find our benchmark script for one model [here](https://github.com/ml-energy/lead
|
|
39 |
|
40 |
- NVIDIA A40 GPU
|
41 |
- NVIDIA A100 GPU
|
|
|
42 |
|
43 |
### Parameters
|
44 |
|
@@ -50,17 +60,11 @@ Find our benchmark script for one model [here](https://github.com/ml-energy/lead
|
|
50 |
- Temperature 0.7
|
51 |
- Repetition penalty 1.0
|
52 |
|
53 |
-
|
54 |
|
55 |
We randomly sampled around 3000 prompts from the [cleaned ShareGPT dataset](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered).
|
56 |
See [here](https://github.com/ml-energy/leaderboard/tree/master/sharegpt) for more detail on how we created the benchmark dataset.
|
57 |
|
58 |
-
## NLP evaluation metrics
|
59 |
-
|
60 |
-
- `arc`: [AI2 Reasoning Challenge](https://allenai.org/data/arc)'s `challenge` dataset, measures capability to do grade-school level question answering, 25 shot
|
61 |
-
- `hellaswag`: [HellaSwag dataset](https://allenai.org/data/hellaswag), measuring grounded commonsense, 10 shot
|
62 |
-
- `truthfulqa`: [TruthfulQA dataset](https://arxiv.org/abs/2109.07958), measuring truthfulness against questions that elicit common falsehoods, 0 shot
|
63 |
-
|
64 |
## Limitations
|
65 |
|
66 |
Currently, inference is run with basically bare PyTorch with batch size 1, which is unrealistic assuming a production serving scenario.
|
@@ -68,18 +72,18 @@ Hence, absolute latency, throughput, and energy numbers should not be used to es
|
|
68 |
|
69 |
## Upcoming
|
70 |
|
71 |
-
- Within the Summer, we'll add an
|
72 |
- More optimized inference runtimes, like TensorRT.
|
73 |
- Larger models with distributed inference, like Falcon 40B.
|
74 |
- More models, like RWKV.
|
75 |
|
76 |
-
|
77 |
|
78 |
This leaderboard is a research preview intended for non-commercial use only.
|
79 |
Model weights were taken as is from the Hugging Face Hub if available and are subject to their licenses.
|
80 |
The use of LLaMA weights are subject to their [license](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md).
|
81 |
Please direct inquiries/reports of potential violation to Jae-Won Chung.
|
82 |
|
83 |
-
|
84 |
|
85 |
-
We thank [Chameleon Cloud](https://www.chameleoncloud.org/)
|
|
|
1 |
The goal of the ML.ENERGY Leaderboard is to give people a sense of how much **energy** LLMs would consume.
|
2 |
|
3 |
+
The code for the leaderboard, backing data, and scripts for benchmarking are all open-source in our [repository](https://github.com/ml-energy/leaderboard).
|
4 |
+
We'll see you at the [Discussion board](https://github.com/ml-energy/leaderboard/discussions), where you can ask questions, suggest improvement ideas, or just discuss leaderboard results!
|
5 |
+
|
6 |
## Columns
|
7 |
|
8 |
+
- `gpu`: NVIDIA GPU model name.
|
9 |
- `task`: Name of the task. See *Tasks* below for details.
|
10 |
- `energy` (J): The average GPU energy consumed by the model to generate a response.
|
11 |
- `throughput` (token/s): The average number of tokens generated per second.
|
12 |
- `latency` (s): The average time it took for the model to generate a response.
|
13 |
- `response_length` (token): The average number of tokens in the model's response.
|
14 |
- `parameters`: The number of parameters the model has, in units of billion.
|
15 |
+
- `arc`: [AI2 Reasoning Challenge](https://allenai.org/data/arc)'s `challenge` dataset. Measures capability to do grade-school level question answering, 25 shot.
|
16 |
+
- `hellaswag`: [HellaSwag dataset](https://allenai.org/data/hellaswag). Measuring grounded commonsense, 10 shot.
|
17 |
+
- `truthfulqa`: [TruthfulQA dataset](https://arxiv.org/abs/2109.07958). Measuring truthfulness against questions that elicit common falsehoods, 0 shot.
|
18 |
+
|
19 |
+
NLP evaluation metrics (`arc`, `hellaswag`, and `truthfulqa`) were only run once each on A40 GPUs because their results do not depend on the GPU type.
|
20 |
+
Hence, all GPU model rows for the same model share the same NLP evaluation numbers.
|
21 |
|
22 |
## Tasks
|
23 |
|
|
|
48 |
|
49 |
- NVIDIA A40 GPU
|
50 |
- NVIDIA A100 GPU
|
51 |
+
- NVIDIA V100 GPU
|
52 |
|
53 |
### Parameters
|
54 |
|
|
|
60 |
- Temperature 0.7
|
61 |
- Repetition penalty 1.0
|
62 |
|
63 |
+
### Data
|
64 |
|
65 |
We randomly sampled around 3000 prompts from the [cleaned ShareGPT dataset](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered).
|
66 |
See [here](https://github.com/ml-energy/leaderboard/tree/master/sharegpt) for more detail on how we created the benchmark dataset.
|
67 |
|
|
|
|
|
|
|
|
|
|
|
|
|
68 |
## Limitations
|
69 |
|
70 |
Currently, inference is run with basically bare PyTorch with batch size 1, which is unrealistic assuming a production serving scenario.
|
|
|
72 |
|
73 |
## Upcoming
|
74 |
|
75 |
+
- Within the Summer, we'll add an online text generation interface for real time energy consumption measurement!
|
76 |
- More optimized inference runtimes, like TensorRT.
|
77 |
- Larger models with distributed inference, like Falcon 40B.
|
78 |
- More models, like RWKV.
|
79 |
|
80 |
+
## License
|
81 |
|
82 |
This leaderboard is a research preview intended for non-commercial use only.
|
83 |
Model weights were taken as is from the Hugging Face Hub if available and are subject to their licenses.
|
84 |
The use of LLaMA weights are subject to their [license](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md).
|
85 |
Please direct inquiries/reports of potential violation to Jae-Won Chung.
|
86 |
|
87 |
+
## Acknowledgements
|
88 |
|
89 |
+
We thank [Chameleon Cloud](https://www.chameleoncloud.org/) and [CloudLab](https://cloudlab.us/) for the GPU nodes.
|