cogbuji's picture
Update README.md
1b09eaa verified
---
base_model: internistai/base-7b-v0.2
datasets:
- omi-health/medical-dialogue-to-soap-summary
language:
- en
license: apache-2.0
metrics:
- accuracy
tags:
- medical
- mlx
tag: text-generation
---
![image/png](/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F651d96a3e8c4c2ebaafc1e7d%2FuyiryuBhU4y62f4CRxabO.png%3C%2Fspan%3E)%3C!-- HTML_TAG_END -->
The Model [cogbuji/MrGrammaticaOntology-internistai-SCT-DRIFT-clinical-problem-0.6.5](https://huggingface.co/cogbuji/MrGrammaticaOntology-internistai-SCT-DRIFT-clinical-problem-0.6.5) was converted to MLX format from [internistai/base-7b-v0.2](https://huggingface.co/internistai/base-7b-v0.2) using mlx-lm version **0.16.0**.
The name of the model is a homage to Fela Kuti's song __Mr Grammarticalogy-Lisationalsim Is The Boss__ released on the B-side of his 1976 LP [Excuse O](https://www.discogs.com/release/3149841-Fela-And-The-Africa-70-Excuse-O).
It is an experimental model for non-production environments inspired by explorations into how large language models can be trained to be more conversant in medical terminology and concepts and used in various medical informatics scenarios.
It is a LoRa finetune of [internistai/base-7b-v0.2](https://huggingface.co/internistai/base-7b-v0.2) using [controlled natural language (CNL) phrases] generated from the September 23rd release of [SNOMED CT United States Edition](https://www.snomed.org/snomed-ct/Use-SNOMED-CT). The general idea is described in [Reference Domain Ontologies and Large Medical Language Models](https://www.slideshare.net/slideshow/reference-domain-ontologies-and-large-medical-language-modelspptx/267024290).
During the training, LoRa was applied to all linear layers using a dataset comprising 318,798 SNOMED-CT DRIFT phrases from the SNOMED-CT [concept hierarchies](https://nhsengland.kahootz.com/gf2.ti/f/762498/152743141.1/PDF/-/SNOMED%20Implementation_User%20Guide_Hierarchies.pdf) relevant to medical problems (findings, morphologic abnormalities, situations with explicit context, and disorders) and 7,400 records from the Synthetic Medical Dialogues and SOAP Summaries [dataset](https://huggingface.co/datasets/omi-health/medical-dialogue-to-soap-summary). The training ran for two days, 13 hours, and 55 minutes using mlx-tuning fork, a framework for parameterized large language model (Q)LoRa fine tuning on Apple Metal.
Below is a snippet of the configuration used (the format has changed over time):
```yaml
lora_parameters:
keys: ["self_attn.q_proj","self_attn.v_proj","self_attn.k_proj","self_attn.o_proj"]
rank: 32
alpha: 32
dropout: 0.3205
scale: 10.0
epochs: 2
learning_schedule:
type: "cosine_w_warmup"
warmup_proportion: .1
min_lr: 1e-7
cycle_length: -1
min_cos_lr: 7e-6
```
The wand db log is below:
> 79,700 iterations at 39,850 iterations per epoch on a dataset of 318,798 records, 8 at a time.
>
>
## MMLU-SR benchmarks
Below are before and after [MMLU-SR benchmark](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/mmlusr) scores for the MMLU medical topics listed below were measured before and afterwards. MMLU-SR is a dataset
used by the LM Evaluation Harness for rigorous benchmarking of true model comprehension.
### Before (unquantized internistai lm-eval run on Apple Metal)
>>> hf (pretrained=internistai/base-7b-v0.2,dtype=float), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 64
| Tasks |Version|Filter|n-shot|Metric| |Value | |Stderr|
|---------------------|------:|------|-----:|------|---|-----:|---|-----:|
|clinical knowledge | 0|none | 0|acc |↑ |0.5019|± |0.0308|
|professional medicine| 0|none | 0|acc |↑ |0.5441|± |0.0303|
### After (unquantized internistai lm-eval run on Apple Metal)
hf (pretrained=../raw_models/outbox/MrGrammaticaOntology-internistai-SCT-DRIFT-clinical-problem-0.6.5,dtype=float), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 64
| Tasks |Version|Filter|n-shot|Metric| |Value | |Stderr|
|---------------------|------:|------|-----:|------|---|-----:|---|-----:|
|clinical knowledge | 0|none | 0|acc |↑ |0.5208|± |0.0307|
|professional medicine| 0|none | 0|acc |↑ |0.5625|± |0.0301|
## Use with mlx
```bash
pip install mlx-lm
```
```python
from mlx_lm import load, generate
model, tokenizer = load("cogbuji/MrGrammaticalOntology-internistai-SCT-core-0.6.5")
response = generate(model, tokenizer, prompt="hello", verbose=True)
```