update readme
Browse files
README.md
CHANGED
@@ -17,30 +17,9 @@ Llama-3-8B-UltraMedical has achieved top average scores across several popular m
|
|
17 |
In these benchmarks, Llama-3-8B-UltraMedical significantly outperforms Flan-PaLM, OpenBioLM-8B, Gemini-1.0, GPT-3.5, and Meditron-70b.
|
18 |
We extend our gratitude to Meta for the Llama model, which provided an excellent foundation for our fine-tuning efforts.
|
19 |
|
20 |
-
##
|
21 |
|
22 |
-
|
23 |
-
|
24 |
-
This model is trained using the full parameters and the Fully Sharded Data Parallel (FSDP) framework.
|
25 |
-
The training process was performed on 8 x A6000 GPUs for about 50 hours.
|
26 |
-
|
27 |
-
Hyperparameters:
|
28 |
-
|
29 |
-
- torch type: bfloat16
|
30 |
-
- epochs: 3
|
31 |
-
- learning rate: 2e-5
|
32 |
-
- learning rate scheduler type: cosine
|
33 |
-
- warmup ratio: 0.04
|
34 |
-
- max length: 1024
|
35 |
-
- global batch size: 128
|
36 |
-
|
37 |
-
- **License:** [Meta Llama-3 License](https://llama.meta.com/llama3/license/).
|
38 |
-
- **Finetuned from model:** [Meta-Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B)
|
39 |
-
- **Finetuned on data:** [UltraMedical](https://github.com/TsinghuaC3I/UltraMedical)
|
40 |
-
|
41 |
-
### Usage
|
42 |
-
|
43 |
-
#### Chat Template
|
44 |
|
45 |
This model utilizes the Llama-3 default chat template without a system prompt.
|
46 |
Below, we provide input examples for multi-choice QA, PubMedQA, and open-ended questions.
|
@@ -78,7 +57,7 @@ Investigate the mechanistic implications of statins, primarily used for lipid mo
|
|
78 |
```
|
79 |
|
80 |
|
81 |
-
|
82 |
|
83 |
```python
|
84 |
from transformers import AutoTokenizer
|
@@ -128,8 +107,34 @@ In the table above:
|
|
128 |
|
129 |
- For MedQA, we use the 4 options from the US set. For MedMCQA, we use the Dev split. For PubMedQA, we use the reasoning required set.
|
130 |
|
|
|
|
|
131 |
- Greedy search is employed as our default decoding strategy. We denote ensemble scores with self-consistency as `(Ensemble)`. In our experiments, we conduct 10 decoding trials, and final decisions are made via majority vote (temperature=0.7, top_p=0.9).
|
132 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
133 |
## Limitations & Safe Use
|
134 |
|
135 |
While our model offers promising capabilities, it is crucial to exercise caution when using it in real-world clinical settings due to potential hallucination issues. Hallucinations, where the model generates incorrect or misleading information, can pose significant risks in clinical decision-making. Users are advised to validate the model's outputs with trusted medical sources and expert consultation to ensure safety and accuracy.
|
|
|
17 |
In these benchmarks, Llama-3-8B-UltraMedical significantly outperforms Flan-PaLM, OpenBioLM-8B, Gemini-1.0, GPT-3.5, and Meditron-70b.
|
18 |
We extend our gratitude to Meta for the Llama model, which provided an excellent foundation for our fine-tuning efforts.
|
19 |
|
20 |
+
## Usage
|
21 |
|
22 |
+
### Chat Template
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
23 |
|
24 |
This model utilizes the Llama-3 default chat template without a system prompt.
|
25 |
Below, we provide input examples for multi-choice QA, PubMedQA, and open-ended questions.
|
|
|
57 |
```
|
58 |
|
59 |
|
60 |
+
### Inference with vLLM
|
61 |
|
62 |
```python
|
63 |
from transformers import AutoTokenizer
|
|
|
107 |
|
108 |
- For MedQA, we use the 4 options from the US set. For MedMCQA, we use the Dev split. For PubMedQA, we use the reasoning required set.
|
109 |
|
110 |
+
- For MMLU, we include Clinical Knowledge (CK), Medical Genetics (MG), Anatomy (An), Professional Medicine (PM), College Biology (CB), and College Medicine (CM) to maintain consistency with previous studies.
|
111 |
+
|
112 |
- Greedy search is employed as our default decoding strategy. We denote ensemble scores with self-consistency as `(Ensemble)`. In our experiments, we conduct 10 decoding trials, and final decisions are made via majority vote (temperature=0.7, top_p=0.9).
|
113 |
|
114 |
+
- Partial results for 7B pre-trained models are sourced from the [Open Medical-LLM Leaderboard](https://huggingface.co/spaces/openlifescienceai/open_medical_llm_leaderboard).
|
115 |
+
|
116 |
+
## Training Details
|
117 |
+
|
118 |
+
<!-- Provide a longer summary of what this model is. -->
|
119 |
+
|
120 |
+
This model is trained using the full parameters and the Fully Sharded Data Parallel (FSDP) framework.
|
121 |
+
The training process was performed on 8 x A6000 GPUs for about 50 hours.
|
122 |
+
|
123 |
+
Hyperparameters:
|
124 |
+
|
125 |
+
- torch type: bfloat16
|
126 |
+
- epochs: 3
|
127 |
+
- learning rate: 2e-5
|
128 |
+
- learning rate scheduler type: cosine
|
129 |
+
- warmup ratio: 0.04
|
130 |
+
- max length: 1024
|
131 |
+
- global batch size: 128
|
132 |
+
|
133 |
+
- **License:** [Meta Llama-3 License](https://llama.meta.com/llama3/license/).
|
134 |
+
- **Finetuned from model:** [Meta-Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B)
|
135 |
+
- **Finetuned on data:** [UltraMedical](https://github.com/TsinghuaC3I/UltraMedical)
|
136 |
+
|
137 |
+
|
138 |
## Limitations & Safe Use
|
139 |
|
140 |
While our model offers promising capabilities, it is crucial to exercise caution when using it in real-world clinical settings due to potential hallucination issues. Hallucinations, where the model generates incorrect or misleading information, can pose significant risks in clinical decision-making. Users are advised to validate the model's outputs with trusted medical sources and expert consultation to ensure safety and accuracy.
|