haijunlv commited on
Commit
4e13d8e
·
verified ·
1 Parent(s): 5e4eefa
Files changed (1) hide show
  1. README.md +30 -5
README.md CHANGED
@@ -86,7 +86,7 @@ from transformers import AutoTokenizer, AutoModelForCausalLM
86
  model_dir = "internlm/internlm3-8b-instruct"
87
  tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
88
  # Set `torch_dtype=torch.float16` to load model in float16, otherwise it will be loaded as float32 and might cause OOM Error.
89
- # model = AutoModelForCausalLM.from_pretrained(model_dir, trust_remote_code=True, torch_dtype=torch.float16)
90
  # (Optional) If on low resource devices, you can load model in 4-bit or 8-bit to further save GPU memory via bitsandbytes.
91
  # InternLM3 8B in 4bit will cost nearly 8GB GPU memory.
92
  # pip install -U bitsandbytes
@@ -108,6 +108,8 @@ generated_ids = model.generate(tokenized_chat, max_new_tokens=1024, temperature=
108
  generated_ids = [
109
  output_ids[len(input_ids):] for input_ids, output_ids in zip(tokenized_chat, generated_ids)
110
  ]
 
 
111
  response = tokenizer.batch_decode(generated_ids)[0]
112
  print(response)
113
  ```
@@ -153,6 +155,10 @@ Find more details in the [LMDeploy documentation](https://lmdeploy.readthedocs.i
153
 
154
 
155
 
 
 
 
 
156
  #### vLLM inference
157
 
158
  We are still working on merging the PR(https://github.com/vllm-project/vllm/pull/12037) into vLLM. In the meantime, please use the following PR link to install it manually.
@@ -280,6 +286,8 @@ generated_ids = model.generate(tokenized_chat, max_new_tokens=8192)
280
  generated_ids = [
281
  output_ids[len(input_ids):] for input_ids, output_ids in zip(tokenized_chat, generated_ids)
282
  ]
 
 
283
  response = tokenizer.batch_decode(generated_ids)[0]
284
  print(response)
285
  ```
@@ -308,6 +316,10 @@ response = pipe(messages, gen_config=GenerationConfig(max_new_tokens=2048))
308
  print(response)
309
  ```
310
 
 
 
 
 
311
  #### vLLM inference
312
 
313
  We are still working on merging the PR(https://github.com/vllm-project/vllm/pull/12037) into vLLM. In the meantime, please use the following PR link to install it manually.
@@ -345,7 +357,7 @@ print(outputs)
345
 
346
  ## Open Source License
347
 
348
- The code is licensed under Apache-2.0, while model weights are fully open for academic research and also allow **free** commercial usage. To apply for a commercial license, please fill in the [application form (English)](https://wj.qq.com/s2/12727483/5dba/)/[申请表(中文)](https://wj.qq.com/s2/12725412/f7c1/). For other questions or collaborations, please contact <[email protected]>.
349
 
350
  ## Citation
351
 
@@ -369,7 +381,7 @@ The code is licensed under Apache-2.0, while model weights are fully open for ac
369
  InternLM3,即书生·浦语大模型第3代,开源了80亿参数,面向通用使用与高阶推理的指令模型(InternLM3-8B-Instruct)。模型具备以下特点:
370
 
371
  - **更低的代价取得更高的性能**:
372
- 在推理、知识类任务上取得同量级最优性能,超过Llama3.1-8B和Qwen2.5-7B. 值得关注的是InternLM3只用了4万亿词元进行训练,对比同级别模型训练成本节省75%以上。
373
  - **深度思考能力**:
374
  InternLM3支持通过长思维链求解复杂推理任务的深度思考模式,同时还兼顾了用户体验更流畅的通用回复模式。
375
 
@@ -423,7 +435,7 @@ from transformers import AutoTokenizer, AutoModelForCausalLM
423
  model_dir = "internlm/internlm3-8b-instruct"
424
  tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
425
  # Set `torch_dtype=torch.float16` to load model in float16, otherwise it will be loaded as float32 and might cause OOM Error.
426
- # model = AutoModelForCausalLM.from_pretrained(model_dir, trust_remote_code=True, torch_dtype=torch.float16)
427
  # (Optional) If on low resource devices, you can load model in 4-bit or 8-bit to further save GPU memory via bitsandbytes.
428
  # InternLM3 8B in 4bit will cost nearly 8GB GPU memory.
429
  # pip install -U bitsandbytes
@@ -445,6 +457,8 @@ generated_ids = model.generate(tokenized_chat, max_new_tokens=1024, temperature=
445
  generated_ids = [
446
  output_ids[len(input_ids):] for input_ids, output_ids in zip(tokenized_chat, generated_ids)
447
  ]
 
 
448
  response = tokenizer.batch_decode(generated_ids)[0]
449
  print(response)
450
  ```
@@ -491,7 +505,12 @@ curl http://localhost:23333/v1/chat/completions \
491
 
492
 
493
 
 
 
 
 
494
  ##### vLLM 推理
 
495
  我们还在推动PR(https://github.com/vllm-project/vllm/pull/12037) 合入vllm,现在请使用以下PR链接手动安装
496
 
497
  ```python
@@ -616,6 +635,8 @@ generated_ids = model.generate(tokenized_chat, max_new_tokens=8192)
616
  generated_ids = [
617
  output_ids[len(input_ids):] for input_ids, output_ids in zip(tokenized_chat, generated_ids)
618
  ]
 
 
619
  response = tokenizer.batch_decode(generated_ids)[0]
620
  print(response)
621
  ```
@@ -644,6 +665,10 @@ response = pipe(messages, gen_config=GenerationConfig(max_new_tokens=2048))
644
  print(response)
645
  ```
646
 
 
 
 
 
647
  ##### vLLM 推理
648
 
649
  我们还在推动PR(https://github.com/vllm-project/vllm/pull/12037) 合入vllm,现在请使用以下PR链接手动安装
@@ -687,7 +712,7 @@ print(outputs)
687
 
688
  ## 开源许可证
689
 
690
- 本仓库的代码依照 Apache-2.0 协议开源。模型权重对学术研究完全开放,也可申请免费的商业使用授权([申请表](https://wj.qq.com/s2/12725412/f7c1/))。其他问题与合作请联系 <[email protected]>。
691
 
692
  ## 引用
693
 
 
86
  model_dir = "internlm/internlm3-8b-instruct"
87
  tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
88
  # Set `torch_dtype=torch.float16` to load model in float16, otherwise it will be loaded as float32 and might cause OOM Error.
89
+ model = AutoModelForCausalLM.from_pretrained(model_dir, trust_remote_code=True, torch_dtype=torch.float16)
90
  # (Optional) If on low resource devices, you can load model in 4-bit or 8-bit to further save GPU memory via bitsandbytes.
91
  # InternLM3 8B in 4bit will cost nearly 8GB GPU memory.
92
  # pip install -U bitsandbytes
 
108
  generated_ids = [
109
  output_ids[len(input_ids):] for input_ids, output_ids in zip(tokenized_chat, generated_ids)
110
  ]
111
+ prompt = tokenizer.batch_decode(tokenized_chat)[0]
112
+ print(prompt)
113
  response = tokenizer.batch_decode(generated_ids)[0]
114
  print(response)
115
  ```
 
155
 
156
 
157
 
158
+ #### Ollama inference
159
+
160
+ TODO
161
+
162
  #### vLLM inference
163
 
164
  We are still working on merging the PR(https://github.com/vllm-project/vllm/pull/12037) into vLLM. In the meantime, please use the following PR link to install it manually.
 
286
  generated_ids = [
287
  output_ids[len(input_ids):] for input_ids, output_ids in zip(tokenized_chat, generated_ids)
288
  ]
289
+ prompt = tokenizer.batch_decode(tokenized_chat)[0]
290
+ print(prompt)
291
  response = tokenizer.batch_decode(generated_ids)[0]
292
  print(response)
293
  ```
 
316
  print(response)
317
  ```
318
 
319
+ #### Ollama inference
320
+
321
+ TODO
322
+
323
  #### vLLM inference
324
 
325
  We are still working on merging the PR(https://github.com/vllm-project/vllm/pull/12037) into vLLM. In the meantime, please use the following PR link to install it manually.
 
357
 
358
  ## Open Source License
359
 
360
+ Code and model weights are licensed under Apache-2.0.
361
 
362
  ## Citation
363
 
 
381
  InternLM3,即书生·浦语大模型第3代,开源了80亿参数,面向通用使用与高阶推理的指令模型(InternLM3-8B-Instruct)。模型具备以下特点:
382
 
383
  - **更低的代价取得更高的性能**:
384
+ 在推理、知识类任务上取得同量级最优性能,超过Llama3.1-8B和Qwen2.5-7B。值得关注的是InternLM3只用了4万亿词元进行训练,对比同级别模型训练成本节省75%以上。
385
  - **深度思考能力**:
386
  InternLM3支持通过长思维链求解复杂推理任务的深度思考模式,同时还兼顾了用户体验更流畅的通用回复模式。
387
 
 
435
  model_dir = "internlm/internlm3-8b-instruct"
436
  tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
437
  # Set `torch_dtype=torch.float16` to load model in float16, otherwise it will be loaded as float32 and might cause OOM Error.
438
+ model = AutoModelForCausalLM.from_pretrained(model_dir, trust_remote_code=True, torch_dtype=torch.float16)
439
  # (Optional) If on low resource devices, you can load model in 4-bit or 8-bit to further save GPU memory via bitsandbytes.
440
  # InternLM3 8B in 4bit will cost nearly 8GB GPU memory.
441
  # pip install -U bitsandbytes
 
457
  generated_ids = [
458
  output_ids[len(input_ids):] for input_ids, output_ids in zip(tokenized_chat, generated_ids)
459
  ]
460
+ prompt = tokenizer.batch_decode(tokenized_chat)[0]
461
+ print(prompt)
462
  response = tokenizer.batch_decode(generated_ids)[0]
463
  print(response)
464
  ```
 
505
 
506
 
507
 
508
+ ##### Ollama 推理
509
+
510
+ TODO
511
+
512
  ##### vLLM 推理
513
+
514
  我们还在推动PR(https://github.com/vllm-project/vllm/pull/12037) 合入vllm,现在请使用以下PR链接手动安装
515
 
516
  ```python
 
635
  generated_ids = [
636
  output_ids[len(input_ids):] for input_ids, output_ids in zip(tokenized_chat, generated_ids)
637
  ]
638
+ prompt = tokenizer.batch_decode(tokenized_chat)[0]
639
+ print(prompt)
640
  response = tokenizer.batch_decode(generated_ids)[0]
641
  print(response)
642
  ```
 
665
  print(response)
666
  ```
667
 
668
+ ##### Ollama 推理
669
+
670
+ TODO
671
+
672
  ##### vLLM 推理
673
 
674
  我们还在推动PR(https://github.com/vllm-project/vllm/pull/12037) 合入vllm,现在请使用以下PR链接手动安装
 
712
 
713
  ## 开源许可证
714
 
715
+ 本仓库的代码和权重依照 Apache-2.0 协议开源。
716
 
717
  ## 引用
718