TsinghuaC3I
/

Llama-3-8B-UltraMedical

@@ -17,30 +17,9 @@ Llama-3-8B-UltraMedical has achieved top average scores across several popular m
 In these benchmarks, Llama-3-8B-UltraMedical significantly outperforms Flan-PaLM, OpenBioLM-8B, Gemini-1.0, GPT-3.5, and Meditron-70b.
 We extend our gratitude to Meta for the Llama model, which provided an excellent foundation for our fine-tuning efforts.
-## Model Details
-<!-- Provide a longer summary of what this model is. -->
-This model is trained using the full parameters and the Fully Sharded Data Parallel (FSDP) framework.
-The training process was performed on 8 x A6000 GPUs for about 50 hours.
-Hyperparameters:
-- torch type: bfloat16
-- epochs: 3
-- learning rate: 2e-5
-- learning rate scheduler type: cosine
-- warmup ratio: 0.04
-- max length: 1024
-- global batch size: 128
-- **License:** [Meta Llama-3 License](https://llama.meta.com/llama3/license/).
-- **Finetuned from model:** [Meta-Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B)
-- **Finetuned on data:** [UltraMedical](https://github.com/TsinghuaC3I/UltraMedical)
-### Usage
-#### Chat Template
 This model utilizes the Llama-3 default chat template without a system prompt.
 Below, we provide input examples for multi-choice QA, PubMedQA, and open-ended questions.
@@ -78,7 +57,7 @@ Investigate the mechanistic implications of statins, primarily used for lipid mo
 ```
-#### Inference with vLLM
 ```python
 from transformers import AutoTokenizer
@@ -128,8 +107,34 @@ In the table above:
 - For MedQA, we use the 4 options from the US set. For MedMCQA, we use the Dev split. For PubMedQA, we use the reasoning required set.
 - Greedy search is employed as our default decoding strategy. We denote ensemble scores with self-consistency as `(Ensemble)`. In our experiments, we conduct 10 decoding trials, and final decisions are made via majority vote (temperature=0.7, top_p=0.9).
 ## Limitations & Safe Use
 While our model offers promising capabilities, it is crucial to exercise caution when using it in real-world clinical settings due to potential hallucination issues. Hallucinations, where the model generates incorrect or misleading information, can pose significant risks in clinical decision-making. Users are advised to validate the model's outputs with trusted medical sources and expert consultation to ensure safety and accuracy.

 In these benchmarks, Llama-3-8B-UltraMedical significantly outperforms Flan-PaLM, OpenBioLM-8B, Gemini-1.0, GPT-3.5, and Meditron-70b.
 We extend our gratitude to Meta for the Llama model, which provided an excellent foundation for our fine-tuning efforts.
+## Usage
+### Chat Template
 This model utilizes the Llama-3 default chat template without a system prompt.
 Below, we provide input examples for multi-choice QA, PubMedQA, and open-ended questions.
 ```
+### Inference with vLLM
 ```python
 from transformers import AutoTokenizer
 - For MedQA, we use the 4 options from the US set. For MedMCQA, we use the Dev split. For PubMedQA, we use the reasoning required set.
+- For MMLU, we include Clinical Knowledge (CK), Medical Genetics (MG), Anatomy (An), Professional Medicine (PM), College Biology (CB), and College Medicine (CM) to maintain consistency with previous studies.
 - Greedy search is employed as our default decoding strategy. We denote ensemble scores with self-consistency as `(Ensemble)`. In our experiments, we conduct 10 decoding trials, and final decisions are made via majority vote (temperature=0.7, top_p=0.9).
+- Partial results for 7B pre-trained models are sourced from the [Open Medical-LLM Leaderboard](https://huggingface.co/spaces/openlifescienceai/open_medical_llm_leaderboard).
+## Training Details
+<!-- Provide a longer summary of what this model is. -->
+This model is trained using the full parameters and the Fully Sharded Data Parallel (FSDP) framework.
+The training process was performed on 8 x A6000 GPUs for about 50 hours.
+Hyperparameters:
+- torch type: bfloat16
+- epochs: 3
+- learning rate: 2e-5
+- learning rate scheduler type: cosine
+- warmup ratio: 0.04
+- max length: 1024
+- global batch size: 128
+- **License:** [Meta Llama-3 License](https://llama.meta.com/llama3/license/).
+- **Finetuned from model:** [Meta-Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B)
+- **Finetuned on data:** [UltraMedical](https://github.com/TsinghuaC3I/UltraMedical)
 ## Limitations & Safe Use
 While our model offers promising capabilities, it is crucial to exercise caution when using it in real-world clinical settings due to potential hallucination issues. Hallucinations, where the model generates incorrect or misleading information, can pose significant risks in clinical decision-making. Users are advised to validate the model's outputs with trusted medical sources and expert consultation to ensure safety and accuracy.