a-mannion commited on
Commit
84facb3
·
verified ·
1 Parent(s): 4434c3a

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +134 -0
README.md ADDED
@@ -0,0 +1,134 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - fr
5
+ library_name: transformers
6
+ tags:
7
+ - linformer
8
+ - legal
9
+ - RoBERTa
10
+ - pytorch
11
+ ---
12
+
13
+ # Jargon-legal
14
+
15
+ [Jargon](https://hal.science/hal-04535557/file/FB2_domaines_specialises_LREC_COLING24.pdf) is an efficient transformer encoder LM for French, combining the LinFormer attention mechanism with the RoBERTa model architecture.
16
+
17
+ Jargon is available in several versions with different context sizes and types of pre-training corpora.
18
+
19
+ <!-- Provide a quick summary of what the model is/does. -->
20
+
21
+ <!-- This modelcard aims to be a base template for new models. It has been generated using [this raw template](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md?plain=1).
22
+ -->
23
+
24
+ | **Model** | **Initialised from...** |**Training Data**|
25
+ |-------------------------------------------------------------------------------------|:-----------------------:|:----------------:|
26
+ | [jargon-general-base](https://huggingface.co/PantagrueLLM/jargon-general-base) | scratch |8.5GB Web Corpus|
27
+ | [jargon-general-biomed](https://huggingface.co/PantagrueLLM/jargon-general-biomed) | jargon-general-base |5.4GB Medical Corpus|
28
+ | [jargon-general-legal](https://huggingface.co/PantagrueLLM/jargon-general-legal) | jargon-general-base |18GB Legal Corpus
29
+ | [jargon-multidomain-base](https://huggingface.co/PantagrueLLM/jargon-multidomain-base) | jargon-general-base |Medical+Legal Corpora|
30
+ | [jargon-legal](https://huggingface.co/PantagrueLLM/jargon-legal) (this model) | scratch |18GB Legal Corpus|
31
+ | [jargon-legal-4096](https://huggingface.co/PantagrueLLM/jargon-legal-4096) | scratch |18GB Legal Corpus|
32
+ | [jargon-biomed](https://huggingface.co/PantagrueLLM/jargon-biomed) | scratch |5.4GB Medical Corpus|
33
+ | [jargon-biomed-4096](https://huggingface.co/PantagrueLLM/jargon-biomed-4096) | scratch |5.4GB Medical Corpus|
34
+ | [jargon-NACHOS](https://huggingface.co/PantagrueLLM/jargon-NACHOS) | scratch |[NACHOS](https://drbert.univ-avignon.fr/)|
35
+ | [jargon-NACHOS-4096](https://huggingface.co/PantagrueLLM/jargon-NACHOS-4096) | scratch |[NACHOS](https://drbert.univ-avignon.fr/)|
36
+
37
+
38
+ ## Evaluation
39
+
40
+ The Jargon models were evaluated on an range of specialized downstream tasks.
41
+
42
+ #### Legal Domain Benchmark
43
+
44
+ Results averaged across five funs with varying random seeds.
45
+
46
+ | |[ECtHR-FR](https://huggingface.co/datasets/audibeal/fr-echr)|[OACS](https://www.jeuxdemots.org/OACS/oacs.php)|[SJP](https://aclanthology.org/2021.nllp-1.3/)|
47
+ |-------------------------|:-----------------------:|:-----------------------:|:-----------------------:|
48
+ | **Task Type** | Document Classification | Document Classification | Document Classification |
49
+ | **Metric** | Macro-F1 | Macro-F1 | Macro-F1 |
50
+ | jargon-general-base | 42.9 | 50.8 | 55.1 |
51
+ | jargon-multidomain-base | 44.5 | 55.6 | 58.1 |
52
+ | jargon-general-legal | 43.1 | 49.9 | 54.5 |
53
+ | jargon-legal | 44.6 | 51.6 | 56.7 |
54
+ | jargon-legal-4096 | 45.9 | 54.1 | 68.2 |
55
+
56
+ For more info please check out the [paper](https://hal.science/hal-04535557/file/FB2_domaines_specialises_LREC_COLING24.pdf), accepted for publication at [LREC-COLING 2024](https://lrec-coling-2024.org/list-of-accepted-papers/).
57
+
58
+
59
+ ## Using Jargon models with HuggingFace transformers
60
+
61
+ You can get started with this model using the code snippet below:
62
+
63
+ ```python
64
+ from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline
65
+
66
+ tokenizer = AutoTokenizer.from_pretrained("PantagrueLLM/jargon-legal", trust_remote_code=True)
67
+ model = AutoModelForMaskedLM.from_pretrained("PantagrueLLM/jargon-legal", trust_remote_code=True)
68
+
69
+ jargon_maskfiller = pipeline("fill-mask", model=model, tokenizer=tokenizer)
70
+ output = jargon_maskfiller("Il est allé au <mask> hier")
71
+ ```
72
+
73
+ You can also use the classes `AutoModel`, `AutoModelForSequenceClassification`, or `AutoModelForTokenClassification` to load Jargon models, depending on the downstream task in question.
74
+
75
+ - **Language(s):** French
76
+ - **License:** MIT
77
+ - **Developed by:** Vincent Segonne
78
+ - **Funded by**
79
+ - GENCI-IDRIS (Grant 2022 A0131013801)
80
+ - French National Research Agency: Pantagruel grant ANR-23-IAS1-0001
81
+ - MIAI@Grenoble Alpes ANR-19-P3IA-0003
82
+ - PROPICTO ANR-20-CE93-0005
83
+ - Lawbot ANR-20-CE38-0013
84
+ - Swiss National Science Foundation (grant PROPICTO N°197864)
85
+ - **Authors**
86
+ - Vincent Segonne
87
+ - Aidan Mannion
88
+ - Laura Cristina Alonzo Canul
89
+ - Alexandre Audibert
90
+ - Xingyu Liu
91
+ - Cécile Macaire
92
+ - Adrien Pupier
93
+ - Yongxin Zhou
94
+ - Mathilde Aguiar
95
+ - Felix Herron
96
+ - Magali Norré
97
+ - Massih-Reza Amini
98
+ - Pierrette Bouillon
99
+ - Iris Eshkol-Taravella
100
+ - Emmanuelle Esperança-Rodier
101
+ - Thomas François
102
+ - Lorraine Goeuriot
103
+ - Jérôme Goulian
104
+ - Mathieu Lafourcade
105
+ - Benjamin Lecouteux
106
+ - François Portet
107
+ - Fabien Ringeval
108
+ - Vincent Vandeghinste
109
+ - Maximin Coavoux
110
+ - Marco Dinarelli
111
+ - Didier Schwab
112
+
113
+
114
+
115
+ ## Citation
116
+
117
+ If you use this model for your own research work, please cite as follows:
118
+
119
+ ```bibtex
120
+ @inproceedings{segonne:hal-04535557,
121
+ TITLE = {{Jargon: A Suite of Language Models and Evaluation Tasks for French Specialized Domains}},
122
+ AUTHOR = {Segonne, Vincent and Mannion, Aidan and Alonzo Canul, Laura Cristina and Audibert, Alexandre and Liu, Xingyu and Macaire, C{\'e}cile and Pupier, Adrien and Zhou, Yongxin and Aguiar, Mathilde and Herron, Felix and Norr{\'e}, Magali and Amini, Massih-Reza and Bouillon, Pierrette and Eshkol-Taravella, Iris and Esperan{\c c}a-Rodier, Emmanuelle and Fran{\c c}ois, Thomas and Goeuriot, Lorraine and Goulian, J{\'e}r{\^o}me and Lafourcade, Mathieu and Lecouteux, Benjamin and Portet, Fran{\c c}ois and Ringeval, Fabien and Vandeghinste, Vincent and Coavoux, Maximin and Dinarelli, Marco and Schwab, Didier},
123
+ URL = {https://hal.science/hal-04535557},
124
+ BOOKTITLE = {{LREC-COLING 2024 - Joint International Conference on Computational Linguistics, Language Resources and Evaluation}},
125
+ ADDRESS = {Turin, Italy},
126
+ YEAR = {2024},
127
+ MONTH = May,
128
+ KEYWORDS = {Self-supervised learning ; Pretrained language models ; Evaluation benchmark ; Biomedical document processing ; Legal document processing ; Speech transcription},
129
+ PDF = {https://hal.science/hal-04535557/file/FB2_domaines_specialises_LREC_COLING24.pdf},
130
+ HAL_ID = {hal-04535557},
131
+ HAL_VERSION = {v1},
132
+ }
133
+ ```
134
+