license: apache-2.0
datasets:
- PleIAs/common_corpus
language:
- en
- fr
- es
- de
- it
- la
- nl
- pl
Pleias-360m-Preview is an early preview of a 360 million parameter base model trained by Pleias on Common Corpus.
Like all the base and specialized models from Pleias, Pleias-360m-Preview has only been trained on open data out of copyright (public domain) or under a permissible license.
Description
Pleias-360m-Preview is a transformer base model, entirely pretrained from scratch, using an architecture similar to Llama/GPT-Neox for easier deployment/inference.
It includes the following features, that would apply to any responsibly trained variant:
- Only trained on open data under a permissible license and in compliance with the European AI Act. By design, all Pleias model are unable to output copyrighted content.
- Extensive multilingual support for main European languages.
- A new tokenizer designed for enhanced document processing tasks and better multilingual support.
- Extremely low level of toxicity and problematic content.
Pleias-360m-Preview has demonstrated unusual abilities for multilingual generation in its size range. Fully supported languages include English, French, Spanish, German, Italian, Dutch, Latin and Portuguese.
Given its size, Pleias-360m-Preview can run on CPU without any compression loss. We provide a first GGUF variant as part of our release.
Recommended use
As a base model, Pleias-360m-Preview is only able to run continuation prompts.
Text generation is currently able to support a range of creative writing tasks in multiple European languages. For more consistent results we recommend using a low or null temperature with a slight repetition penalty (1.1-1.2).
Pleias-360m-Preview has been successfully adapted for continuous pretraining and full-fine-tuning on document processing tasks such as RAG, translation or OCR correction. Given the small size of the model we do not recommend fine-tuning methods based on LORA.
Examples
With only 360 million parameters, Pleias-360m-Preview is able to generate literary texts in multiple languages:
English poetry:
Saturn is fallen, am I too to fall?
Am I to leave this haven of my rest,
This cradle of my glory, this soft clime,
This calm luxuriance of blissful light,
And the sweetness that was not in me?
I have been a slave and an enemy.
The world has made me its prey ; it's wrong;
It's cruelty! It's madness! — but you know
That all things are good for one man or another :
You see how he can be happy without them:
He cannot be miserable unless they come
To his heart with their own poisonous breaths.
But if there were no other way left us,
We should never find our happiness again.
If we could only live on earth alone,
With nothing else about us than ourselves,
Our very being would become more real.
There must be some other place where we may
Live happily as before, and enjoy life
As well as now. But what shall we do here?
What will happen when we die? What will hap-
[pen then?
When death comes, let him who knows the end
French classic play with the 17th century writing style (including the long s):
Helas! pour mon malheur ie l’ay trop écouté.
Ie n’ay point du ſilence affecté le myſtère.
Ie croyois ſans péril pouuoir eſtre ſincere.
Et ſans armer mes yeux d’vn moment de rigueur,
Je ne voy pas que ce foit vn crime à la nature:
Mais je vous prie, qu'on me donne quelque recompenſe :
Pourquoy donc auec tant de gloire &c en vne heure?
Cela fait voir bien clairement que ma deſpenſe
Eſt plus grande que celle des autres hommes ;
Car il eſt certain que les Princes font moins heureux;
Que ceux qui ont eu leur vie par leurs mains confacrée.
Il y a deux fortes de perſonnes dans le monde :
Les uns fe nomment gens de lettres ou de proueſleurz,
Dont on dit que c'eſt eux-mesme qui nous donnent tousjours
Leur efprit & leur cœur aux affaires publiques ;
Ainſi ils s'appellent gens de cour, comme ſi cela
N'eſtoit autre choſe que ces grands Seigneurs qui gouuernenc
L'Etat,&qui ſe conſeruent toujours avec honneur.
Ils ſont fort peu de choſes ſans ambition ;
Training
Pleias-360m-Preview was fully pretrained at Jean Zay on 64 h100s for 46 hours with Nanotron, the pretraining library from HuggingFace. We provide the complete settings as a yaml file as part of our release.
Training schedule includes 518,000 steps (batch size 1,024) on a filtered and enhanced version of Common Corpus (1,086,324,736,000 tokens).
Update
Pleias-360m-Preview is currently released as an early preview.
The model will undergo several more round of post-training to enhance reasoning capacities and fine-tunability as well as in anticipation of a generalist instruct version.