File size: 3,154 Bytes
c0c608e 15fda87 c0c608e 15fda87 c0c608e 2e5a857 e1c6db2 c0c608e f43e254 c0c608e 26cbbfb c0c608e e1c6db2 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 |
---
language:
- de
license: bigscience-bloom-rail-1.0
library_name: transformers
tags:
- ggml
- bloom
datasets:
- oscar
pipeline_tag: text-generation
---
# BLOOM-CLP German (6.4B parameters)
This is a monolingual German language model trained using the [CLP-Transfer](https://arxiv.org/abs/2301.09626) method based on [BLOOM-7b1](https://huggingface.co/bigscience/bloom-7b1).
You can try out the model at [European Language Grid](https://live.european-language-grid.eu/catalogue/tool-service/20825/try%20out/).
<span style="color:blue">UPDATE: We recently released an instruction-tuned version of this model: [malteos/bloom-6b4-clp-german-oasst-v0.1](https://huggingface.co/malteos/bloom-6b4-clp-german-oasst-v0.1)</span>.
### How to use
You can use this model directly with a pipeline for text generation. Since the generation relies on some randomness, we
set a seed for reproducibility:
```python
>>> from transformers import pipeline, set_seed
>>> generator = pipeline('text-generation', model='malteos/bloom-6b4-clp-german')
>>> set_seed(42)
>>> generator("Hello, I'm a language model,", max_length=30, num_return_sequences=3)
[{'generated_text': "Hello, I'm a language model, a language for thinking, a language for expressing thoughts."},
{'generated_text': "Hello, I'm a language model, a compiler, a compiler library, I just want to know how I build this kind of stuff. I don"},
{'generated_text': "Hello, I'm a language model, and also have more than a few of your own, but I understand that they're going to need some help"},]
```
## Training dataset
- ca. 50B German tokens
- Web-crawled content from the German subset [OSCAR v22.01](https://oscar-corpus.com/post/oscar-v22-01/) (excluding content tagged as header, footer, noisy, or adult)
- Web-crawled content from the [GC4 Corpus](https://german-nlp-group.github.io/projects/gc4-corpus.html) (including only the head and middle parts)
- Both Web-crawled datasets are deduplicated with [Google's suffix array implementation](https://github.com/google-research/deduplicate-text-datasets)
- German court decisions from [Open Legal Data](http://openlegaldata.io/)
## Code
- [BigScience's Megatron-Deepspeed fork](https://github.com/bigscience-workshop/Megatron-DeepSpeed)
## Hardware
- 32xA100-40GB GPUs
- 12.5 days
- [Tensorboard logs](https://huggingface.co/malteos/bloom-6b4-clp-german-logs/tensorboard)
## Evaluation
Validation PPL compared to from-scratch training (the lower the better):
<img alt="Tokens vs PPL" src="https://github.com/malteos/clp-transfer/raw/main/german-6b-ppl.png">
Additional evaluations can be found in [our paper](https://arxiv.org/abs/2301.09626).
## How to cite
If you are using our code or models, please cite [our paper](https://arxiv.org/abs/2301.09626):
```bibtex
@misc{Ostendorff2023clp,
doi = {10.48550/ARXIV.2301.09626},
author = {Ostendorff, Malte and Rehm, Georg},
title = {Efficient Language Model Training through Cross-Lingual and Progressive Transfer Learning},
publisher = {arXiv},
year = {2023}
}
```
## License
[BigScience BLOOM RAIL 1.0](/static-proxy?url=https%3A%2F%2Fbigscience.huggingface.co%2Fblog%2Fthe-bigscience-rail-license%3C%2Fspan%3E)
|