File size: 11,044 Bytes
c105d48 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 |
Quantization made by Richard Erkhov.
[Github](https://github.com/RichardErkhov)
[Discord](https://discord.gg/pvy7H8DZMG)
[Request more models](https://github.com/RichardErkhov/quant_request)
Memphis-CoT-3B - GGUF
- Model creator: https://huggingface.co/euclaise/
- Original model: https://huggingface.co/euclaise/Memphis-CoT-3B/
| Name | Quant method | Size |
| ---- | ---- | ---- |
| [Memphis-CoT-3B.Q2_K.gguf](https://huggingface.co/RichardErkhov/euclaise_-_Memphis-CoT-3B-gguf/blob/main/Memphis-CoT-3B.Q2_K.gguf) | Q2_K | 1.01GB |
| [Memphis-CoT-3B.IQ3_XS.gguf](https://huggingface.co/RichardErkhov/euclaise_-_Memphis-CoT-3B-gguf/blob/main/Memphis-CoT-3B.IQ3_XS.gguf) | IQ3_XS | 1.11GB |
| [Memphis-CoT-3B.IQ3_S.gguf](https://huggingface.co/RichardErkhov/euclaise_-_Memphis-CoT-3B-gguf/blob/main/Memphis-CoT-3B.IQ3_S.gguf) | IQ3_S | 1.17GB |
| [Memphis-CoT-3B.Q3_K_S.gguf](https://huggingface.co/RichardErkhov/euclaise_-_Memphis-CoT-3B-gguf/blob/main/Memphis-CoT-3B.Q3_K_S.gguf) | Q3_K_S | 1.17GB |
| [Memphis-CoT-3B.IQ3_M.gguf](https://huggingface.co/RichardErkhov/euclaise_-_Memphis-CoT-3B-gguf/blob/main/Memphis-CoT-3B.IQ3_M.gguf) | IQ3_M | 1.23GB |
| [Memphis-CoT-3B.Q3_K.gguf](https://huggingface.co/RichardErkhov/euclaise_-_Memphis-CoT-3B-gguf/blob/main/Memphis-CoT-3B.Q3_K.gguf) | Q3_K | 1.3GB |
| [Memphis-CoT-3B.Q3_K_M.gguf](https://huggingface.co/RichardErkhov/euclaise_-_Memphis-CoT-3B-gguf/blob/main/Memphis-CoT-3B.Q3_K_M.gguf) | Q3_K_M | 1.3GB |
| [Memphis-CoT-3B.Q3_K_L.gguf](https://huggingface.co/RichardErkhov/euclaise_-_Memphis-CoT-3B-gguf/blob/main/Memphis-CoT-3B.Q3_K_L.gguf) | Q3_K_L | 1.4GB |
| [Memphis-CoT-3B.IQ4_XS.gguf](https://huggingface.co/RichardErkhov/euclaise_-_Memphis-CoT-3B-gguf/blob/main/Memphis-CoT-3B.IQ4_XS.gguf) | IQ4_XS | 1.43GB |
| [Memphis-CoT-3B.Q4_0.gguf](https://huggingface.co/RichardErkhov/euclaise_-_Memphis-CoT-3B-gguf/blob/main/Memphis-CoT-3B.Q4_0.gguf) | Q4_0 | 1.5GB |
| [Memphis-CoT-3B.IQ4_NL.gguf](https://huggingface.co/RichardErkhov/euclaise_-_Memphis-CoT-3B-gguf/blob/main/Memphis-CoT-3B.IQ4_NL.gguf) | IQ4_NL | 1.51GB |
| [Memphis-CoT-3B.Q4_K_S.gguf](https://huggingface.co/RichardErkhov/euclaise_-_Memphis-CoT-3B-gguf/blob/main/Memphis-CoT-3B.Q4_K_S.gguf) | Q4_K_S | 1.51GB |
| [Memphis-CoT-3B.Q4_K.gguf](https://huggingface.co/RichardErkhov/euclaise_-_Memphis-CoT-3B-gguf/blob/main/Memphis-CoT-3B.Q4_K.gguf) | Q4_K | 1.59GB |
| [Memphis-CoT-3B.Q4_K_M.gguf](https://huggingface.co/RichardErkhov/euclaise_-_Memphis-CoT-3B-gguf/blob/main/Memphis-CoT-3B.Q4_K_M.gguf) | Q4_K_M | 1.59GB |
| [Memphis-CoT-3B.Q4_1.gguf](https://huggingface.co/RichardErkhov/euclaise_-_Memphis-CoT-3B-gguf/blob/main/Memphis-CoT-3B.Q4_1.gguf) | Q4_1 | 1.65GB |
| [Memphis-CoT-3B.Q5_0.gguf](https://huggingface.co/RichardErkhov/euclaise_-_Memphis-CoT-3B-gguf/blob/main/Memphis-CoT-3B.Q5_0.gguf) | Q5_0 | 1.81GB |
| [Memphis-CoT-3B.Q5_K_S.gguf](https://huggingface.co/RichardErkhov/euclaise_-_Memphis-CoT-3B-gguf/blob/main/Memphis-CoT-3B.Q5_K_S.gguf) | Q5_K_S | 1.81GB |
| [Memphis-CoT-3B.Q5_K.gguf](https://huggingface.co/RichardErkhov/euclaise_-_Memphis-CoT-3B-gguf/blob/main/Memphis-CoT-3B.Q5_K.gguf) | Q5_K | 1.86GB |
| [Memphis-CoT-3B.Q5_K_M.gguf](https://huggingface.co/RichardErkhov/euclaise_-_Memphis-CoT-3B-gguf/blob/main/Memphis-CoT-3B.Q5_K_M.gguf) | Q5_K_M | 1.86GB |
| [Memphis-CoT-3B.Q5_1.gguf](https://huggingface.co/RichardErkhov/euclaise_-_Memphis-CoT-3B-gguf/blob/main/Memphis-CoT-3B.Q5_1.gguf) | Q5_1 | 1.96GB |
| [Memphis-CoT-3B.Q6_K.gguf](https://huggingface.co/RichardErkhov/euclaise_-_Memphis-CoT-3B-gguf/blob/main/Memphis-CoT-3B.Q6_K.gguf) | Q6_K | 2.14GB |
| [Memphis-CoT-3B.Q8_0.gguf](https://huggingface.co/RichardErkhov/euclaise_-_Memphis-CoT-3B-gguf/blob/main/Memphis-CoT-3B.Q8_0.gguf) | Q8_0 | 2.77GB |
Original model description:
---
license: cc-by-sa-3.0
library_name: transformers
tags:
- supertrainer2000
- human-data
datasets:
- euclaise/TinyCoT
- euclaise/reddit-instruct
- sablo/oasst2_curated
- euclaise/SciCoT
metrics:
- accuracy
base_model: stabilityai/stablelm-3b-4e1t
---
*Now with a training bug fixed!*
![image/png](/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F64137e2150358a805203cbac%2FDlTWku8gant1yx6NaxqJX.png%3C%2Fspan%3E)
Memphis-CoT is a finetune of [StableLM 3b 4e1t](stabilityai/stablelm-3b-4e1t) on [TinyCoT](https://huggingface.co/datasets/euclaise/TinyCoT), [SciCoT](https://huggingface.co/datasets/euclaise/SciCoT), along with [reddit-instruct](https://huggingface.co/datasets/euclaise/reddit-instruct) (subset to 5000 examples, excluding posts with brackets in the title) and a [curated](https://huggingface.co/datasets/sablo/oasst2_curated) subset of [oasst2](https://huggingface.co/datasets/OpenAssistant/oasst2).
**Memphis was trained *only* on human data! No GPT generations here.**
Finetuning was performed using my [supertrainer2000](https://github.com/euclaise/supertrainer2000) framework, using my Adalite optimizer.
## Training Procedure
I finetuned the model using an iterative rationale-bootstrapping procedure inspired by [STaR](https://research.google/pubs/star-self-taught-reasoner-bootstrapping-reasoning-with-reasoning/) and [SPIN](https://arxiv.org/abs/2401.01335)
First, I finetuned the model on all the datasets using a [MixCE](https://arxiv.org/abs/2305.16958) loss and [NEFTune](https://arxiv.org/abs/2310.05914), for 2 epochs.
I then performed the following steps 3 times:
1. Generate responses for each question in TinyCoT using the current model, check each response for correctness, and create a dataset of (correct, incorrect) pairs. Extra values are discarded, such that each correct and incorrect response is unique.
2. Finetune the model for 1 epoch using a ranking loss over length-normalized log-probabilities of each sequence, similar to [Preference Ranking Optimization](https://arxiv.org/abs/2306.17492), comparing the correct vs incorrect generated response. Additionally, a standard CE loss over the chosen completion was included.
This should be more efficient than either STaR or SPIN, as it uses a ranking loss rather than rejection sampling (unlike STaR), and verifies correctness instead of assuming all model responses are incorrect (unlike SPIN).
To prevent excessive drift, I kept the model weights as a moving average: After each generate+train cycle, I interpolated between the previous model weights and the updated weights using spherical linear interpolation (SLERP), with an interpolation factor of 0.99.
## Prompt formats
The format for reddit-instruct and oasst2 was:
```
### User:
[insert instruction here]
### Assistant:
[insert response here]
### User:
...
```
The format for TinyCoT was:
```
### User:
[insert instruction here]
### Rationale:
[insert reasoning here]
### Answer:
[insert direct answer here]
```
## Benchmarks
| Model | Size | Data | Method | GSM8K (5-shot) | AGIEval (English/Nous subset, acc_norm) | BIG Bench Hard (CoT, few-shot*) |
|:-----------------------------------------------------------------------|--------|:--------------------|---------------|:---------------|:----------------------------------------|:------------------------------ |
| [StableLM 3B Base](https://hf.co/stabilityai/stablelm-3b-4e1t) | 3B | Base | Base | 2.05% | 25.14% | 36.75% |
| [StableHermes 3B](https://hf.co/cxllin/StableHermes-3b) | 3B | GPT | SFT | 3.64% | 24.31% | **37.28%** |
| [MPT 7B Instruct](https://hf.co/mosaicml/mpt-7b-instruct) | **7B** | **Human**+Anthropic | SFT | 2.05% | 24.12% | 11.01% |
| [OpenLLaMA 7B v2 open-instruct](http://hf.co/VMware/open-llama-7b-v2-open-instruct) | **7B** | **Human** (nearly: ecqa is an exception) | SFT | 8.64% | 23.21% | 29.84% |
| [StableLM Zephyr 3B](https://hf.co/stabilityai/stablelm-zephyr-3b) | 3B | GPT | DPO | possibly contaminated (45.72%) | **33.31%** | 0.91% |
| [LIMA LLaMA 2 7B](https://huggingface.co/heegyu/LIMA2-7b-hf) | **7B** | **Human** | SFT | 4.55% | 24.55% | 36.29% |
| [**Memphis-CoT 3B**](https://hf.co/euclaise/Memphis-CoT-3B) | 3B | **Human** | Self-teaching | **18.8%** | *27.22%* | *36.92%* |
*5-shot, as performed automatically by LM Evaluation Harness bbh_cot_fewshot even with num_fewshot=0
Memphis outperforms other primarily-human-data models that are over twice its size, along with SFT models of its size, and trades with the Zephyr DPO model. That said, Zephyr uses synthetic data, and *much* more of it.
Note that BBH results have wide SEs, sometimes even exceeding 16%.
It is unclear why Zephyr performs so poorly on BBH. Perhaps it is overfit, or maybe there was an issue with vllm.
Notes:
- Evaluations were performed using the `agieval` branch of [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) (commit `0bef5c9c273b1c2f68e6018d4bb9c32b9aaff298`), using the `vllm` model.
- I tried to find human-data-trained StableLM models, but couldn't find any. I did find a few OpenLLaMA models, but they wouldn't load with LM Eval Harness and vllm. (I believe this can be fixed by changing the xformers backend, but I'm too lazy for that)
- OpenLLaMA 7B v2 open-instruct is a particularly relevant comparison, as it was trained on a *very* similar dataset.
## Hyperparameters
For the initial supervised finetuning step:
- Adalite optimizer, default hyperparameters of supertrainer2000 unless otherwise specified
- Lambda (Adalite's analogue to weight decay, see [here](https://arxiv.org/abs/2103.06583) for details) of 0.01
- LR of 1e-5
- MixCE ratio of 0.75
- Sequence length of 4096
- Cosine decay with a 20% warmup
- Frozen embeddings
- No training on inputs
- Accumulated batch size of 128
- NEFTune with an alpha of 10
For the generations:
- Generated using the current git version of `vllm`
- N=8
- Temperature of 0.5
- `top_p` of 0.8
- Maximum of 512 generated tokens, discarding responses that do not have a valid rationale and answer
For the rank finetuning:
- Adalite optimizer, default hyperparameters of supertrainer2000 unless otherwise specified
- Lambda of 0.01
- LR of 5e-7
- Rank loss weight of 0.25
- Sequence length of 1024
- Cosine schedule with 10% warmup
- Frozen embeddings
- No training on inputs
- Accumulated batch size of 128
- NEFTune with an alpha of 10
Additional thanks to @nicoboss for giving me access to his private supercomputer, enabling me to provide many more quants, at much higher speed, than I would otherwise be able to. |