Twice-KoSOLAR-16.1B-test

Model Details

Model Developers Kyujin Han (kyujinpy)

๋ชจ๋ธ ๋ชฉ์ 

์ตœ๊ทผ, SOLAR-10.7B ๋ชจ๋ธ์ด Depth-Up-Scaling(์œ„์˜ ์‚ฌ์ง„) ๋ฐฉ๋ฒ•๋ก ์„ ๋‚ด์„ธ์›Œ์„œ LLM ๋ฆฌ๋”๋ณด๋“œ์—์„œ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๊ณ  ์žˆ๋‹ค. ๋”๋ถˆ์–ด์„œ ์•ผ๋†€์ž์—์„œ ๋งŒ๋“  seungduk/KoSOLAR-10.7B-v0.1 ๋ชจ๋ธ์€ Ko-LLM ๋ฆฌ๋”๋ณด๋“œ์— ํฐ ํŒŒ๊ธ‰๋ ฅ์„ ๋ถˆ๋Ÿฌ์˜ค๋ฉด์„œ, ์•ž์œผ๋กœ์˜ ๋ฆฌ๋”๋ณด๋“œ์˜ ํ๋ฆ„๋„ ๋ฐ”๋€” ๊ฒƒ์œผ๋กœ ์˜ˆ์ƒ๋œ๋‹ค.

์—ฌ๊ธฐ์„œ ๋‹จ์ˆœํ•œ ํ˜ธ๊ธฐ์‹ฌ์ด ๋“ค์—ˆ๋‹ค. Upstage์—์„œ ๋ฐœํ‘œํ•œ Depth-Up-Scaling(DUS) ๋ฐฉ๋ฒ•๋ก ์€ mistral-7B ๋ชจ๋ธ 2๊ฐœ๋ฅผ merge(passthrough)ํ•œ ๋ฐฉ๋ฒ•์ด๋‹ค.
์ด๋•Œ ๋†€๋ž๊ฒŒ๋„, DUS ๋ฐฉ๋ฒ•๋ก ์„ ์ ์šฉํ•œ upstage/SOLAR-10.7B-v1.0๋ชจ๋ธ์€ ๊ธฐ์กด์˜ mistral-7B ๋ชจ๋ธ๋ณด๋‹ค ๋ฆฌ๋”๋ณด๋“œ์—์„œ ๋†’์€ ์„ฑ๋Šฅ์„ ๊ธฐ๋กํ–ˆ๋‹ค. (์•„๋ž˜์˜ ํ…Œ์ด๋ธ” ์ฐธ๊ณ )
๊ทธ๋ ‡๋‹ค๋ฉด, DUS ๋ฐฉ๋ฒ•๋ก ์„ ์ œํ•œ์—†์ด, ๋‹ค๋ฅธ ๋ชจ๋ธ์— ์ ์šฉํ•˜๋ฉด ๋˜‘๊ฐ™์€ ๊ฒฐ๊ณผ๊ฐ€ ๋ฐœ์ƒํ• ์ง€ ๋„ˆ๋ฌด๋‚˜ ๊ถ๊ธˆํ–ˆ๋‹ค. ๐Ÿ™ƒ ์‹คํ—˜์„ ํ†ตํ•ด์„œ ๋‚˜์˜ ํ˜ธ๊ธฐ์‹ฌ์— ๋Œ€ํ•œ ๊ฒฐ๋ก ์„ ๋‚ด๋ ค๋ณด๊ณ ์ž ํ•œ๋‹ค. ๐Ÿ˜‹๐Ÿ˜‹

Model Average ARC HellaSwag MMLU TruthfulQA Winogrande GSM8K
seungduk/KoSOLAR-10.7B-v0.1 66.04 62.03 84.54 65.56 45.03 83.58 55.50
upstage/SOLAR-10.7B-v1.0 66.04 61.95 84.60 65.48 45.04 83.66 55.50
mistralai/Mistral-7B-v0.1 60.97 59.98 83.31 64.16 42.15 78.37 37.83

Follow up as En-link.

Method
Using Mergekit.

Merge config
๊ธฐ์กด SOLAR-10.7B ๋…ผ๋ฌธ์—์„œ๋Š” mistral-7B layer๋ฅผ layer-24์™€ layer-8๋กœ ๊ตฌ๋ถ„ํ•˜์—ฌ์„œ, layer-24 2๊ฐœ๋ฅผ ๋ณ‘ํ•ฉํ•˜์—ฌ ์ด layer-48๋ฅผ ์™„์„ฑํ–ˆ๋‹ค.
์œ„์˜ ratio๊ฐ€ uses:waste=3:1 ์ด๋ฏ€๋กœ, ๊ฐ™์€ ๋น„์œจ๋กœ seungduk/KoSOLAR-10.7B-v0.1 layer๋ฅผ layer-36์™€ layer-12๋กœ ๊ตฌ๋ถ„ํ•˜๊ณ , layer-36 2๊ฐœ๋ฅผ ๋ณ‘ํ•ฉํ•˜์—ฌ ์ด layer-72๋ฅผ ์™„์„ฑ์‹œ์ผฐ๋‹ค.
์ž์„ธํ•œ merge config ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

slices:
  - sources:
    - model: seungduk/KoSOLAR-10.7B-v0.1
      layer_range: [0, 36]
  - sources:
    - model: seungduk/KoSOLAR-10.7B-v0.1
      layer_range: [12, 48]
merge_method: passthrough
dtype: float16

Share all of things. It is my belief.

Model Benchmark

Open Ko-LLM leaderboard & lm-evaluation-harness(zero-shot)

gpt2 (pretrained=PracticeLLM/Twice-KoSOLAR-16.1B-test), limit: None, provide_description: False, num_fewshot: 0, batch_size: None
|      Task      |Version| Metric |Value |   |Stderr|
|----------------|------:|--------|-----:|---|-----:|
|kobest_boolq    |      0|acc     |0.7201|ยฑ  |0.0120|
|                |       |macro_f1|0.7073|ยฑ  |0.0124|
|kobest_copa     |      0|acc     |0.6510|ยฑ  |0.0151|
|                |       |macro_f1|0.6506|ยฑ  |0.0151|
|kobest_hellaswag|      0|acc     |0.4520|ยฑ  |0.0223|
|                |       |acc_norm|0.5820|ยฑ  |0.0221|
|                |       |macro_f1|0.4475|ยฑ  |0.0222|
|kobest_sentineg |      0|acc     |0.7078|ยฑ  |0.0229|
|                |       |macro_f1|0.7071|ยฑ  |0.0229|

gpt2 (pretrained=Megastudy/M-SOLAR-10.7B-v1.1-beta), limit: None, provide_description: False, num_fewshot: 0, batch_size: None
|      Task      |Version| Metric |Value |   |Stderr|
|----------------|------:|--------|-----:|---|-----:|
|kobest_boolq    |      0|acc     |0.7137|ยฑ  |0.0121|
|                |       |macro_f1|0.6878|ยฑ  |0.0128|
|kobest_copa     |      0|acc     |0.7060|ยฑ  |0.0144|
|                |       |macro_f1|0.7054|ยฑ  |0.0145|
|kobest_hellaswag|      0|acc     |0.4620|ยฑ  |0.0223|
|                |       |acc_norm|0.5360|ยฑ  |0.0223|
|                |       |macro_f1|0.4595|ยฑ  |0.0223|
|kobest_sentineg |      0|acc     |0.7431|ยฑ  |0.0220|
|                |       |macro_f1|0.7295|ยฑ  |0.0230|

gpt2 (pretrained=jjourney1125/M-SOLAR-10.7B-v1.0), limit: None, provide_description: False, num_fewshot: 0, batch_size: None
|      Task      |Version| Metric |Value |   |Stderr|
|----------------|------:|--------|-----:|---|-----:|
|kobest_boolq    |      0|acc     |0.5228|ยฑ  |0.0133|
|                |       |macro_f1|0.3788|ยฑ  |0.0097|
|kobest_copa     |      0|acc     |0.6860|ยฑ  |0.0147|
|                |       |macro_f1|0.6858|ยฑ  |0.0147|
|kobest_hellaswag|      0|acc     |0.4580|ยฑ  |0.0223|
|                |       |acc_norm|0.5380|ยฑ  |0.0223|
|                |       |macro_f1|0.4552|ยฑ  |0.0222|
|kobest_sentineg |      0|acc     |0.6474|ยฑ  |0.0240|
|                |       |macro_f1|0.6012|ยฑ  |0.0257|

gpt2 (pretrained=yanolja/KoSOLAR-10.7B-v0.1), limit: None, provide_description: False, num_fewshot: 0, batch_size: None
|      Task      |Version| Metric |Value |   |Stderr|
|----------------|------:|--------|-----:|---|-----:|
|kobest_boolq    |      0|acc     |0.8725|ยฑ  |0.0089|
|                |       |macro_f1|0.8722|ยฑ  |0.0089|
|kobest_copa     |      0|acc     |0.6850|ยฑ  |0.0147|
|                |       |macro_f1|0.6844|ยฑ  |0.0147|
|kobest_hellaswag|      0|acc     |0.4340|ยฑ  |0.0222|
|                |       |acc_norm|0.5840|ยฑ  |0.0221|
|                |       |macro_f1|0.4296|ยฑ  |0.0221|
|kobest_sentineg |      0|acc     |0.7506|ยฑ  |0.0217|
|                |       |macro_f1|0.7505|ยฑ  |0.0217|

Open EN-LLM leaderboard & lm-evaluation-harness(zero-shot)

(will update)

Implementation Code

### KO-Platypus
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

repo = "PracticeLLM/Twice-KoSOLAR-test"
OpenOrca = AutoModelForCausalLM.from_pretrained(
        repo,
        return_dict=True,
        torch_dtype=torch.float16,
        device_map='auto'
)
OpenOrca_tokenizer = AutoTokenizer.from_pretrained(repo)

--- Refereces (Model Card)

yanolja/KoSOLAR-10.7B-v0.1

This model is a Korean vocabulary-extended version of upstage/SOLAR-10.7B-v1.0, trained on various Korean web-crawled datasets that are publicly available on HuggingFace. The hypothesis was that while maintaining the original performance of the base model, we could add more tokens to the base model's vocabulary by training the embeddings for the new tokens only. The evaluation results seem to indicate that both English and Korean performances were preserved.

Model Description

Most parameters of upstage/SOLAR-10.7B-v1.0 were frozen except for the embed_tokens layer and the lm_head layer. Embeddings for the existing tokens in those layers were frozen during training. The embeddings for the new tokens have been tuned.

Meet 10.7B Solar: Elevating Performance with Upstage Depth UP Scaling!

Introduction

We introduce SOLAR-10.7B, an advanced large language model (LLM) with 10.7 billion parameters, demonstrating superior performance in various natural language processing (NLP) tasks. It's compact, yet remarkably powerful, and demonstrates unparalleled state-of-the-art performance in models with parameters under 30B.

We present a methodology for scaling LLMs called depth up-scaling (DUS) , which encompasses architectural modifications and continued pretraining. In other words, we integrated Mistral 7B weights into the upscaled layers, and finally, continued pre-training for the entire model.

SOLAR-10.7B has remarkable performance. It outperforms models with up to 30B parameters, even surpassing the recent Mixtral 8X7B model. For detailed information, please refer to the experimental table. Solar 10.7B is an ideal choice for fine-tuning. SOLAR-10.7B offers robustness and adaptability for your fine-tuning needs. Our simple instruction fine-tuning using the SOLAR-10.7B pre-trained model yields significant performance improvements (SOLAR-10.7B-Instruct-v1.0).

For full details of this model please read our paper.

Downloads last month
4,235
Safetensors
Model size
16.1B params
Tensor type
FP16
ยท
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for PracticeLLM/Twice-KoSOLAR-16.1B-test

Quantizations
1 model