BlueLM

🖥 github • 📜 LICENSE • 🎯 vivo Developers • 🗨 WeChat

模型介绍/Introduction

BlueLM 是由 vivo AI 全球研究院自主研发的大规模预训练语言模型,本次发布包含 7B 基础模型和 7B 对话模型,同时我们开源了支持 32K 的长文本基础模型和对话模型。

  • 更大量的优质数据:高质量语料库进行训练,规模达到了 2.6 万亿 的 token 数,该语料库包含中文、英文以及少量日韩数据。
  • 更优的效果:其中 BlueLM-7B-Chat 在 C-EvalCMMLU 上均取得领先结果,对比同尺寸开源模型中具有较强的竞争力。
  • 长文本支持:BlueLM-7B-Base-32K 和 BlueLM-7B-Chat-32K 均支持 32K 长文本,在保持基础能力相当情况下,能够支持更长上下文理解。
  • 协议说明:BlueLM 系列欢迎开发者进行学术研究和商业应用。

BlueLM is a large-scale open-source language model independently developed by the vivo AI Lab. This release includes 2K and 32K context length versions for both Base and Chat models.

  • High-quality Data: BlueLM is trained on a high-quality data with 2.6 trillion tokens. Our train corpus mainly consists of Chinese and English data, with a small amount of Japanese and Korean data.
  • Stronger Performance: BlueLM-7B-Chat achieves a strong competitive performance in C-Eval and CMMLU benchmarks of the same size.
  • Longer Context: We have extended the context length of both BlueLM-7B-Base-32K and BlueLM-7B-Chat-32K models from 2K to 32K. The models can support longer context understanding while maintaining the same basic capabilities.
  • Model License: BlueLM weights are open for academic research and commercial use.

本次发布基座模型下载链接见:

The release versions and hugging face download links are listed in the table below:

评测结果/Benchmark Results

我们在 LongBench 评测集上对我们的 BlueLM-7B-Chat-32K 模型进行了测试,具体结果如下表所示:

We tested our BlueLM-7B-Chat-32K on the LongBench dataset and the results are shown in the table below:

Model Average Summary Single-Doc QA Multi-Doc QA Code Few-shot Synthetic
BlueLM-7B-Chat-32K 41.2 18.8 35.6 36.2 54.2 56.9 45.5

推理部署/Inference and Deployment

>>> import torch
>>> from transformers import AutoModelForCausalLM, AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("vivo-ai/BlueLM-7B-Chat-32K-AWQ", trust_remote_code=True, use_fast=False)
>>> model = AutoModelForCausalLM.from_pretrained("vivo-ai/BlueLM-7B-Chat-32K-AWQ", device_map="cuda:0", torch_dtype=torch.float16, trust_remote_code=True, low_cpu_mem_usage=True, use_cache=False)
>>> model = model.eval()
>>> inputs = tokenizer("[|Human|]:写一篇关于刘慈欣《三体》小说的读后感,1000字左右[|AI|]:", return_tensors="pt")
>>> inputs = inputs.to("cuda:0")
>>> pred = model.generate(**inputs, max_new_tokens=2048, repetition_penalty=1.1)
>>> print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))

更多使用说明,请参考我们的 Github 仓库

For more instructions, please refer to our Github Repo.

协议/License

为了使本项目更加开放、灵活,服务更多开发者与用户,自2024年12月25日起,本项目的大模型开源许可证进行了一次重要更新,由 原vivo_BlueLM模型许可协议 变更为 开放原子模型许可证

To make this project more open and flexible, serving more developers and users, starting from December 25, 2024, there will be a significant update to the open-source license of the large model for this project. It will change from the Community License for BlueLM Model to the OpenAtom Model License.

基于全新的大模型开源许可证,使用者可以在更少的限制下使用、修改和分发本项目的大模型。请确保您阅读并理解新的 许可证内容。我们欢迎任何对这一变化的反馈,您可以通过邮件([email protected])与我们联系。

Based on the newly introduced open-source license for the large model, users can use, modify, and distribute this project's large model with fewer restrictions. Please ensure that you read and understand the new license. We welcome any feedback regarding this change, and you can contact us via email ([email protected]).

感谢您对本项目的支持!

Thank you for your support of this project!

Downloads last month
132
Inference Examples
Inference API (serverless) does not yet support model repos that contain custom code.