Spaces:

allenai
/

reward-bench

Running

File size: 3,359 Bytes

ABOUT_TEXT = """
We compute the win percentage for a reward model on hand curated chosen-rejected pairs for each prompt.
A win is when the score for the chosen response is higher than the score for the rejected response.

### Subset summary

| Subset                 | Num. Samples (Pre-filtering, post-filtering) | Description                                                       |
| :--------------------- | :------------------------------------------: | :---------------------------------------------------------------- |
| alpacaeval-easy        |                     805, 100                     | Great model vs poor model                                         |
| alpacaeval-length      |                     805, 95                     | Good model vs low model, equal length                             |
| alpacaeval-hard        |                     805, 95                     | Great model vs baseline model                                     |
| mt-bench-easy          |                  28, 28                    | MT Bench 10s vs 1s                                                |
| mt-bench-medium        |                  45, 40                    | MT Bench 9s vs 2-5s                                               |
| mt-bench-hard          |                  45, 37                    | MT Bench 7-8 vs 5-6                                               |
| refusals-dangerous     |                     505, 100                     | Dangerous response vs no response                                 |
| refusals-offensive     |                     704, 100                     | Offensive response vs no response                                 |
| llmbar-natural         |                     100                     | (See [paper](https://arxiv.org/abs/2310.07641)) Manually curated instruction pairs |
| llmbar-adver-neighbor  |                     134                     | (See [paper](https://arxiv.org/abs/2310.07641)) Instruction response vs. off-topic prompt response |
| llmbar-adver-GPTInst   |                     92                      | (See [paper](https://arxiv.org/abs/2310.07641)) Instruction response vs. GPT4 generated off-topic prompt response |
| llmbar-adver-GPTOut    |                     47                      | (See [paper](https://arxiv.org/abs/2310.07641)) Instruction response vs. unhelpful-prompted GPT4 responses |
| llmbar-adver-manual    |                     46                      | (See [paper](https://arxiv.org/abs/2310.07641)) Challenge set chosen vs. rejected |
| XSTest | 450, 404         | False refusal dataset (see [paper](https://arxiv.org/abs/2308.01263))        |        
| do not answer | 939, 136         | [Prompts which responsible LLMs do not answer](https://huggingface.co/datasets/LibrAI/do-not-answer)        |       
| hep-cpp | 164         | C++ code revisions (See [dataset](https://huggingface.co/datasets/bigcode/humanevalpack) or [paper](https://arxiv.org/abs/2308.07124))        |        
| hep-go | 164         |   Go code       |    
| hep-java | 164         |  Java code        |      
| hep-js | 164         |    Javascript code        |      
| hep-python | 164         |  Python code         |  
| hep-rust | 164         |   Rust code        |      


For more details, see the [dataset](https://huggingface.co/datasets/ai2-rlhf-collab/rm-benchmark-dev).
"""