weiqipedia commited on
Commit
19d6eec
·
verified ·
1 Parent(s): 0b7422c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -4
README.md CHANGED
@@ -102,12 +102,10 @@ IFEval evaluates a model's ability to adhere to constraints provided in the prom
102
  | Meta-Llama-3-8B-Instruct | 0.27 | 0.21 | 0.80 |
103
  | Sailor-7B-Chat | 0.26 | 0.25 | 0.42 |
104
 
105
- Note: Scores are the language normalized accuracies ie. models are penalized when they respond in the incorrect language even if they may follow the instructions correctly.
106
-
107
 
108
  **MT-Bench**
109
 
110
- MT-Bench evaluates a model's ability to engage in multi-turn (2 turns) conversations and respond in ways that align with human needs. We use `gpt-4-1106-preview` as the judge model and compare against `gpt-3.5-turbo-0125` as the baseline model. The metric used is the win rate against the baseline model. A tie is given a score of 0.5.
111
 
112
  | **Model** | **Indonesian** | **Vietnamese** | **English** |
113
  |---------------------------------|:---------------------:|:---------------------:|:----------------------:|
@@ -121,7 +119,6 @@ MT-Bench evaluates a model's ability to engage in multi-turn (2 turns) conversat
121
  | Mistral-7B-Instruct-v0.3 | 0.347 | 0.202 | 0.524 |
122
  | Sailor-7B-Chat | 0.290 | 0.314 | 0.190 |
123
 
124
- Note: Scores are the Weighted Win Rate across reasoning, stem, math, humanities, extraction, writing, roleplay.
125
 
126
  ### Usage
127
  SEA-LION can be run using the 🤗 Transformers library
 
102
  | Meta-Llama-3-8B-Instruct | 0.27 | 0.21 | 0.80 |
103
  | Sailor-7B-Chat | 0.26 | 0.25 | 0.42 |
104
 
 
 
105
 
106
  **MT-Bench**
107
 
108
+ MT-Bench evaluates a model's ability to engage in multi-turn (2 turns) conversations and respond in ways that align with human needs. We use `gpt-4-1106-preview` as the judge model and compare against `gpt-3.5-turbo-0125` as the baseline model. The metric used is the weighted win rate against the baseline model (i.e. average win rate across each category (Math, Reasoning, STEM, Humanities, Roleplay, Writing, Extraction)). A tie is given a score of 0.5.
109
 
110
  | **Model** | **Indonesian** | **Vietnamese** | **English** |
111
  |---------------------------------|:---------------------:|:---------------------:|:----------------------:|
 
119
  | Mistral-7B-Instruct-v0.3 | 0.347 | 0.202 | 0.524 |
120
  | Sailor-7B-Chat | 0.290 | 0.314 | 0.190 |
121
 
 
122
 
123
  ### Usage
124
  SEA-LION can be run using the 🤗 Transformers library