dhigurashi commited on
Commit
0c0c559
·
1 Parent(s): 839f81e

Upload folder using huggingface_hub

Browse files
LICENSE.txt ADDED
@@ -0,0 +1,412 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Copyright 2023- Preferred Networks, Inc. All rights reserved.
2
+
3
+ Apache License
4
+ Version 2.0, January 2004
5
+ http://www.apache.org/licenses/
6
+
7
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
8
+
9
+ 1. Definitions.
10
+
11
+ "License" shall mean the terms and conditions for use, reproduction,
12
+ and distribution as defined by Sections 1 through 9 of this document.
13
+
14
+ "Licensor" shall mean the copyright owner or entity authorized by
15
+ the copyright owner that is granting the License.
16
+
17
+ "Legal Entity" shall mean the union of the acting entity and all
18
+ other entities that control, are controlled by, or are under common
19
+ control with that entity. For the purposes of this definition,
20
+ "control" means (i) the power, direct or indirect, to cause the
21
+ direction or management of such entity, whether by contract or
22
+ otherwise, or (ii) ownership of fifty percent (50%) or more of the
23
+ outstanding shares, or (iii) beneficial ownership of such entity.
24
+
25
+ "You" (or "Your") shall mean an individual or Legal Entity
26
+ exercising permissions granted by this License.
27
+
28
+ "Source" form shall mean the preferred form for making modifications,
29
+ including but not limited to software source code, documentation
30
+ source, and configuration files.
31
+
32
+ "Object" form shall mean any form resulting from mechanical
33
+ transformation or translation of a Source form, including but
34
+ not limited to compiled object code, generated documentation,
35
+ and conversions to other media types.
36
+
37
+ "Work" shall mean the work of authorship, whether in Source or
38
+ Object form, made available under the License, as indicated by a
39
+ copyright notice that is included in or attached to the work
40
+ (an example is provided in the Appendix below).
41
+
42
+ "Derivative Works" shall mean any work, whether in Source or Object
43
+ form, that is based on (or derived from) the Work and for which the
44
+ editorial revisions, annotations, elaborations, or other modifications
45
+ represent, as a whole, an original work of authorship. For the purposes
46
+ of this License, Derivative Works shall not include works that remain
47
+ separable from, or merely link (or bind by name) to the interfaces of,
48
+ the Work and Derivative Works thereof.
49
+
50
+ "Contribution" shall mean any work of authorship, including
51
+ the original version of the Work and any modifications or additions
52
+ to that Work or Derivative Works thereof, that is intentionally
53
+ submitted to Licensor for inclusion in the Work by the copyright owner
54
+ or by an individual or Legal Entity authorized to submit on behalf of
55
+ the copyright owner. For the purposes of this definition, "submitted"
56
+ means any form of electronic, verbal, or written communication sent
57
+ to the Licensor or its representatives, including but not limited to
58
+ communication on electronic mailing lists, source code control systems,
59
+ and issue tracking systems that are managed by, or on behalf of, the
60
+ Licensor for the purpose of discussing and improving the Work, but
61
+ excluding communication that is conspicuously marked or otherwise
62
+ designated in writing by the copyright owner as "Not a Contribution."
63
+
64
+ "Contributor" shall mean Licensor and any individual or Legal Entity
65
+ on behalf of whom a Contribution has been received by Licensor and
66
+ subsequently incorporated within the Work.
67
+
68
+ 2. Grant of Copyright License. Subject to the terms and conditions of
69
+ this License, each Contributor hereby grants to You a perpetual,
70
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
71
+ copyright license to reproduce, prepare Derivative Works of,
72
+ publicly display, publicly perform, sublicense, and distribute the
73
+ Work and such Derivative Works in Source or Object form.
74
+
75
+ 3. Grant of Patent License. Subject to the terms and conditions of
76
+ this License, each Contributor hereby grants to You a perpetual,
77
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
78
+ (except as stated in this section) patent license to make, have made,
79
+ use, offer to sell, sell, import, and otherwise transfer the Work,
80
+ where such license applies only to those patent claims licensable
81
+ by such Contributor that are necessarily infringed by their
82
+ Contribution(s) alone or by combination of their Contribution(s)
83
+ with the Work to which such Contribution(s) was submitted. If You
84
+ institute patent litigation against any entity (including a
85
+ cross-claim or counterclaim in a lawsuit) alleging that the Work
86
+ or a Contribution incorporated within the Work constitutes direct
87
+ or contributory patent infringement, then any patent licenses
88
+ granted to You under this License for that Work shall terminate
89
+ as of the date such litigation is filed.
90
+
91
+ 4. Redistribution. You may reproduce and distribute copies of the
92
+ Work or Derivative Works thereof in any medium, with or without
93
+ modifications, and in Source or Object form, provided that You
94
+ meet the following conditions:
95
+
96
+ (a) You must give any other recipients of the Work or
97
+ Derivative Works a copy of this License; and
98
+
99
+ (b) You must cause any modified files to carry prominent notices
100
+ stating that You changed the files; and
101
+
102
+ (c) You must retain, in the Source form of any Derivative Works
103
+ that You distribute, all copyright, patent, trademark, and
104
+ attribution notices from the Source form of the Work,
105
+ excluding those notices that do not pertain to any part of
106
+ the Derivative Works; and
107
+
108
+ (d) If the Work includes a "NOTICE" text file as part of its
109
+ distribution, then any Derivative Works that You distribute must
110
+ include a readable copy of the attribution notices contained
111
+ within such NOTICE file, excluding those notices that do not
112
+ pertain to any part of the Derivative Works, in at least one
113
+ of the following places: within a NOTICE text file distributed
114
+ as part of the Derivative Works; within the Source form or
115
+ documentation, if provided along with the Derivative Works; or,
116
+ within a display generated by the Derivative Works, if and
117
+ wherever such third-party notices normally appear. The contents
118
+ of the NOTICE file are for informational purposes only and
119
+ do not modify the License. You may add Your own attribution
120
+ notices within Derivative Works that You distribute, alongside
121
+ or as an addendum to the NOTICE text from the Work, provided
122
+ that such additional attribution notices cannot be construed
123
+ as modifying the License.
124
+
125
+ You may add Your own copyright statement to Your modifications and
126
+ may provide additional or different license terms and conditions
127
+ for use, reproduction, or distribution of Your modifications, or
128
+ for any such Derivative Works as a whole, provided Your use,
129
+ reproduction, and distribution of the Work otherwise complies with
130
+ the conditions stated in this License.
131
+
132
+ 5. Submission of Contributions. Unless You explicitly state otherwise,
133
+ any Contribution intentionally submitted for inclusion in the Work
134
+ by You to the Licensor shall be under the terms and conditions of
135
+ this License, without any additional terms or conditions.
136
+ Notwithstanding the above, nothing herein shall supersede or modify
137
+ the terms of any separate license agreement you may have executed
138
+ with Licensor regarding such Contributions.
139
+
140
+ 6. Trademarks. This License does not grant permission to use the trade
141
+ names, trademarks, service marks, or product names of the Licensor,
142
+ except as required for reasonable and customary use in describing the
143
+ origin of the Work and reproducing the content of the NOTICE file.
144
+
145
+ 7. Disclaimer of Warranty. Unless required by applicable law or
146
+ agreed to in writing, Licensor provides the Work (and each
147
+ Contributor provides its Contributions) on an "AS IS" BASIS,
148
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
149
+ implied, including, without limitation, any warranties or conditions
150
+ of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
151
+ PARTICULAR PURPOSE. You are solely responsible for determining the
152
+ appropriateness of using or redistributing the Work and assume any
153
+ risks associated with Your exercise of permissions under this License.
154
+
155
+ 8. Limitation of Liability. In no event and under no legal theory,
156
+ whether in tort (including negligence), contract, or otherwise,
157
+ unless required by applicable law (such as deliberate and grossly
158
+ negligent acts) or agreed to in writing, shall any Contributor be
159
+ liable to You for damages, including any direct, indirect, special,
160
+ incidental, or consequential damages of any character arising as a
161
+ result of this License or out of the use or inability to use the
162
+ Work (including but not limited to damages for loss of goodwill,
163
+ work stoppage, computer failure or malfunction, or any and all
164
+ other commercial damages or losses), even if such Contributor
165
+ has been advised of the possibility of such damages.
166
+
167
+ 9. Accepting Warranty or Additional Liability. While redistributing
168
+ the Work or Derivative Works thereof, You may choose to offer,
169
+ and charge a fee for, acceptance of support, warranty, indemnity,
170
+ or other liability obligations and/or rights consistent with this
171
+ License. However, in accepting such obligations, You may act only
172
+ on Your own behalf and on Your sole responsibility, not on behalf
173
+ of any other Contributor, and only if You agree to indemnify,
174
+ defend, and hold each Contributor harmless for any liability
175
+ incurred by, or claims asserted against, such Contributor by reason
176
+ of your accepting any such warranty or additional liability.
177
+
178
+ END OF TERMS AND CONDITIONS
179
+
180
+ APPENDIX: How to apply the Apache License to your work.
181
+
182
+ To apply the Apache License to your work, attach the following
183
+ boilerplate notice, with the fields enclosed by brackets "[]"
184
+ replaced with your own identifying information. (Don't include
185
+ the brackets!) The text should be enclosed in the appropriate
186
+ comment syntax for the file format. We also recommend that a
187
+ file or class name and description of purpose be included on the
188
+ same "printed page" as the copyright notice for easier
189
+ identification within third-party archives.
190
+
191
+ Copyright [yyyy] [name of copyright owner]
192
+
193
+ Licensed under the Apache License, Version 2.0 (the "License");
194
+ you may not use this file except in compliance with the License.
195
+ You may obtain a copy of the License at
196
+
197
+ http://www.apache.org/licenses/LICENSE-2.0
198
+
199
+ Unless required by applicable law or agreed to in writing, software
200
+ distributed under the License is distributed on an "AS IS" BASIS,
201
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
202
+ See the License for the specific language governing permissions and
203
+ limitations under the License.
204
+
205
+ ---
206
+
207
+ This software contains modified codes from huggingface trainsformers library which is released under Apache v2.0 license.
208
+
209
+ ---
210
+ Copyright 2018- The Hugging Face team. All rights reserved.
211
+
212
+ Apache License
213
+ Version 2.0, January 2004
214
+ http://www.apache.org/licenses/
215
+
216
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
217
+
218
+ 1. Definitions.
219
+
220
+ "License" shall mean the terms and conditions for use, reproduction,
221
+ and distribution as defined by Sections 1 through 9 of this document.
222
+
223
+ "Licensor" shall mean the copyright owner or entity authorized by
224
+ the copyright owner that is granting the License.
225
+
226
+ "Legal Entity" shall mean the union of the acting entity and all
227
+ other entities that control, are controlled by, or are under common
228
+ control with that entity. For the purposes of this definition,
229
+ "control" means (i) the power, direct or indirect, to cause the
230
+ direction or management of such entity, whether by contract or
231
+ otherwise, or (ii) ownership of fifty percent (50%) or more of the
232
+ outstanding shares, or (iii) beneficial ownership of such entity.
233
+
234
+ "You" (or "Your") shall mean an individual or Legal Entity
235
+ exercising permissions granted by this License.
236
+
237
+ "Source" form shall mean the preferred form for making modifications,
238
+ including but not limited to software source code, documentation
239
+ source, and configuration files.
240
+
241
+ "Object" form shall mean any form resulting from mechanical
242
+ transformation or translation of a Source form, including but
243
+ not limited to compiled object code, generated documentation,
244
+ and conversions to other media types.
245
+
246
+ "Work" shall mean the work of authorship, whether in Source or
247
+ Object form, made available under the License, as indicated by a
248
+ copyright notice that is included in or attached to the work
249
+ (an example is provided in the Appendix below).
250
+
251
+ "Derivative Works" shall mean any work, whether in Source or Object
252
+ form, that is based on (or derived from) the Work and for which the
253
+ editorial revisions, annotations, elaborations, or other modifications
254
+ represent, as a whole, an original work of authorship. For the purposes
255
+ of this License, Derivative Works shall not include works that remain
256
+ separable from, or merely link (or bind by name) to the interfaces of,
257
+ the Work and Derivative Works thereof.
258
+
259
+ "Contribution" shall mean any work of authorship, including
260
+ the original version of the Work and any modifications or additions
261
+ to that Work or Derivative Works thereof, that is intentionally
262
+ submitted to Licensor for inclusion in the Work by the copyright owner
263
+ or by an individual or Legal Entity authorized to submit on behalf of
264
+ the copyright owner. For the purposes of this definition, "submitted"
265
+ means any form of electronic, verbal, or written communication sent
266
+ to the Licensor or its representatives, including but not limited to
267
+ communication on electronic mailing lists, source code control systems,
268
+ and issue tracking systems that are managed by, or on behalf of, the
269
+ Licensor for the purpose of discussing and improving the Work, but
270
+ excluding communication that is conspicuously marked or otherwise
271
+ designated in writing by the copyright owner as "Not a Contribution."
272
+
273
+ "Contributor" shall mean Licensor and any individual or Legal Entity
274
+ on behalf of whom a Contribution has been received by Licensor and
275
+ subsequently incorporated within the Work.
276
+
277
+ 2. Grant of Copyright License. Subject to the terms and conditions of
278
+ this License, each Contributor hereby grants to You a perpetual,
279
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
280
+ copyright license to reproduce, prepare Derivative Works of,
281
+ publicly display, publicly perform, sublicense, and distribute the
282
+ Work and such Derivative Works in Source or Object form.
283
+
284
+ 3. Grant of Patent License. Subject to the terms and conditions of
285
+ this License, each Contributor hereby grants to You a perpetual,
286
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
287
+ (except as stated in this section) patent license to make, have made,
288
+ use, offer to sell, sell, import, and otherwise transfer the Work,
289
+ where such license applies only to those patent claims licensable
290
+ by such Contributor that are necessarily infringed by their
291
+ Contribution(s) alone or by combination of their Contribution(s)
292
+ with the Work to which such Contribution(s) was submitted. If You
293
+ institute patent litigation against any entity (including a
294
+ cross-claim or counterclaim in a lawsuit) alleging that the Work
295
+ or a Contribution incorporated within the Work constitutes direct
296
+ or contributory patent infringement, then any patent licenses
297
+ granted to You under this License for that Work shall terminate
298
+ as of the date such litigation is filed.
299
+
300
+ 4. Redistribution. You may reproduce and distribute copies of the
301
+ Work or Derivative Works thereof in any medium, with or without
302
+ modifications, and in Source or Object form, provided that You
303
+ meet the following conditions:
304
+
305
+ (a) You must give any other recipients of the Work or
306
+ Derivative Works a copy of this License; and
307
+
308
+ (b) You must cause any modified files to carry prominent notices
309
+ stating that You changed the files; and
310
+
311
+ (c) You must retain, in the Source form of any Derivative Works
312
+ that You distribute, all copyright, patent, trademark, and
313
+ attribution notices from the Source form of the Work,
314
+ excluding those notices that do not pertain to any part of
315
+ the Derivative Works; and
316
+
317
+ (d) If the Work includes a "NOTICE" text file as part of its
318
+ distribution, then any Derivative Works that You distribute must
319
+ include a readable copy of the attribution notices contained
320
+ within such NOTICE file, excluding those notices that do not
321
+ pertain to any part of the Derivative Works, in at least one
322
+ of the following places: within a NOTICE text file distributed
323
+ as part of the Derivative Works; within the Source form or
324
+ documentation, if provided along with the Derivative Works; or,
325
+ within a display generated by the Derivative Works, if and
326
+ wherever such third-party notices normally appear. The contents
327
+ of the NOTICE file are for informational purposes only and
328
+ do not modify the License. You may add Your own attribution
329
+ notices within Derivative Works that You distribute, alongside
330
+ or as an addendum to the NOTICE text from the Work, provided
331
+ that such additional attribution notices cannot be construed
332
+ as modifying the License.
333
+
334
+ You may add Your own copyright statement to Your modifications and
335
+ may provide additional or different license terms and conditions
336
+ for use, reproduction, or distribution of Your modifications, or
337
+ for any such Derivative Works as a whole, provided Your use,
338
+ reproduction, and distribution of the Work otherwise complies with
339
+ the conditions stated in this License.
340
+
341
+ 5. Submission of Contributions. Unless You explicitly state otherwise,
342
+ any Contribution intentionally submitted for inclusion in the Work
343
+ by You to the Licensor shall be under the terms and conditions of
344
+ this License, without any additional terms or conditions.
345
+ Notwithstanding the above, nothing herein shall supersede or modify
346
+ the terms of any separate license agreement you may have executed
347
+ with Licensor regarding such Contributions.
348
+
349
+ 6. Trademarks. This License does not grant permission to use the trade
350
+ names, trademarks, service marks, or product names of the Licensor,
351
+ except as required for reasonable and customary use in describing the
352
+ origin of the Work and reproducing the content of the NOTICE file.
353
+
354
+ 7. Disclaimer of Warranty. Unless required by applicable law or
355
+ agreed to in writing, Licensor provides the Work (and each
356
+ Contributor provides its Contributions) on an "AS IS" BASIS,
357
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
358
+ implied, including, without limitation, any warranties or conditions
359
+ of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
360
+ PARTICULAR PURPOSE. You are solely responsible for determining the
361
+ appropriateness of using or redistributing the Work and assume any
362
+ risks associated with Your exercise of permissions under this License.
363
+
364
+ 8. Limitation of Liability. In no event and under no legal theory,
365
+ whether in tort (including negligence), contract, or otherwise,
366
+ unless required by applicable law (such as deliberate and grossly
367
+ negligent acts) or agreed to in writing, shall any Contributor be
368
+ liable to You for damages, including any direct, indirect, special,
369
+ incidental, or consequential damages of any character arising as a
370
+ result of this License or out of the use or inability to use the
371
+ Work (including but not limited to damages for loss of goodwill,
372
+ work stoppage, computer failure or malfunction, or any and all
373
+ other commercial damages or losses), even if such Contributor
374
+ has been advised of the possibility of such damages.
375
+
376
+ 9. Accepting Warranty or Additional Liability. While redistributing
377
+ the Work or Derivative Works thereof, You may choose to offer,
378
+ and charge a fee for, acceptance of support, warranty, indemnity,
379
+ or other liability obligations and/or rights consistent with this
380
+ License. However, in accepting such obligations, You may act only
381
+ on Your own behalf and on Your sole responsibility, not on behalf
382
+ of any other Contributor, and only if You agree to indemnify,
383
+ defend, and hold each Contributor harmless for any liability
384
+ incurred by, or claims asserted against, such Contributor by reason
385
+ of your accepting any such warranty or additional liability.
386
+
387
+ END OF TERMS AND CONDITIONS
388
+
389
+ APPENDIX: How to apply the Apache License to your work.
390
+
391
+ To apply the Apache License to your work, attach the following
392
+ boilerplate notice, with the fields enclosed by brackets "[]"
393
+ replaced with your own identifying information. (Don't include
394
+ the brackets!) The text should be enclosed in the appropriate
395
+ comment syntax for the file format. We also recommend that a
396
+ file or class name and description of purpose be included on the
397
+ same "printed page" as the copyright notice for easier
398
+ identification within third-party archives.
399
+
400
+ Copyright [yyyy] [name of copyright owner]
401
+
402
+ Licensed under the Apache License, Version 2.0 (the "License");
403
+ you may not use this file except in compliance with the License.
404
+ You may obtain a copy of the License at
405
+
406
+ http://www.apache.org/licenses/LICENSE-2.0
407
+
408
+ Unless required by applicable law or agreed to in writing, software
409
+ distributed under the License is distributed on an "AS IS" BASIS,
410
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
411
+ See the License for the specific language governing permissions and
412
+ limitations under the License.
README.md ADDED
@@ -0,0 +1,89 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # PLaMo-13B
2
+
3
+ ## Model Description
4
+ PLaMo-13B is a Llama 13B model pre-trained on English and Japanese open datasets, developed by Preferred Networks, Inc.
5
+ PLaMo-13B is released under Apache v2.0 license.
6
+
7
+ PLaMo-13BはPreferred Networks, Incが英語と日本語の公開データセットで訓練したLlama-13Bモデルです。
8
+ PLaMo-13BはApache v2.0ライセンスでリリースされています。
9
+
10
+ ## Usage
11
+
12
+ ### Use a pipeline as a high-level helper
13
+ ```
14
+ import transformers
15
+ pipeline = transformers.pipeline("text-generation", model="pfnet/plamo-13b", trust_remote_code=True)
16
+ print(pipeline("The future of artificial intelligence technology is ", max_new_tokens=32))
17
+ ```
18
+
19
+ ### Load model directly
20
+ ```
21
+ from transformers import AutoTokenizer, AutoModelForCausalLM
22
+ tokenizer = AutoTokenizer.from_pretrained("pfnet/plamo-13b", trust_remote_code=True)
23
+ model = AutoModelForCausalLM.from_pretrained("pfnet/plamo-13b", trust_remote_code=True)
24
+ text = "これからの人工知能技術は"
25
+ input_ids = tokenizer(text, return_tensors="pt").input_ids
26
+ generated_tokens = model.generate(
27
+ inputs=input_ids,
28
+ max_new_tokens=32,
29
+ do_sample=True,
30
+ top_k=50,
31
+ top_p=0.95,
32
+ temperature=1.0,
33
+ )[0]
34
+ print(tokenizer.decode(generated_tokens))
35
+
36
+ ```
37
+
38
+ ## Model Details
39
+
40
+ - Model size: 13B
41
+ - Trained tokens: 1.5T tokens (English: 1.32T tokens, Japanese: 0.18T tokens)
42
+ - Developed by: Preferred Networkfs, Inc
43
+ - Model type: Causal decoder-only
44
+ - Language(s): English, Japanese
45
+ - License: Apache v2.0
46
+
47
+ ## Training Dataset
48
+
49
+ ### English
50
+
51
+ - C4 - English
52
+ - Project Gutenberg
53
+ - RedPajama - Arxiv
54
+ - RedPajama - CommonCrawl - English
55
+ - RedPajama - Github
56
+ - RedPajama - StackExchange
57
+ - RedPajama - Wikipedia
58
+
59
+ ### Japanese
60
+
61
+ - mC4 - Japanese
62
+ - Wikipedia - Japanese
63
+
64
+ ## Tokenizer
65
+ PLaMo-13B uses sentencepiece tokenizer which is trained on a subset of the datasets for model pre-training.
66
+
67
+ ## Bias, Risks, and Limitations
68
+ PLaMo-13B is a new technology that carries risks with use. Testing conducted to date has been in English and Japanese, and has not covered, nor could it cover all scenarios. For these reasons, as with all LLMs, PLaMo-13B’s potential outputs cannot be predicted in advance, and the model may in some instances produce inaccurate, biased or other objectionable responses to user prompts. Therefore, before deploying any applications of PLaMo-13B, developers should perform safety testing and tuning tailored to their specific applications of the model.
69
+
70
+ ## How to cite
71
+ ```
72
+ @online{PLaMo2023Introducing,
73
+ author = {Preferred Networks, Inc},
74
+ title = {PLaMo-13B},
75
+ year = {2023},
76
+ url = {https://huggingface.co/pfnet/plamo-13b},
77
+ urldate = {2023-09-28}
78
+ }
79
+ ```
80
+
81
+ ## Citations
82
+ ```
83
+ @article{touvron2023llama,
84
+ title={LLaMA: Open and Efficient Foundation Language Models},
85
+ author={Touvron, Hugo and Lavril, Thibaut and Izacard, Gautier and Martinet, Xavier and Lachaux, Marie-Anne and Lacroix, Timoth{\'e}e and Rozi{\`e}re, Baptiste and Goyal, Naman and Hambro, Eric and Azhar, Faisal and Rodriguez, Aurelien and Joulin, Armand and Grave, Edouard and Lample, Guillaume},
86
+ journal={arXiv preprint arXiv:2302.13971},
87
+ year={2023}
88
+ }
89
+ ```
config.json ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "PlamoForCausalLM"
4
+ ],
5
+ "auto_map": {
6
+ "AutoConfig": "modeling_plamo.PlamoConfig",
7
+ "AutoModelForCausalLM": "modeling_plamo.PlamoForCausalLM"
8
+ },
9
+ "bos_token_id": 1,
10
+ "eos_token_id": 2,
11
+ "hidden_act": "silu",
12
+ "hidden_size": 5120,
13
+ "initializer_range": 0.02,
14
+ "intermediate_size": 16640,
15
+ "max_position_embeddings": 8192,
16
+ "model_type": "plamo",
17
+ "n_shared_head": 8,
18
+ "num_attention_heads": 40,
19
+ "num_hidden_layers": 40,
20
+ "num_key_value_heads": 40,
21
+ "pad_token_id": 0,
22
+ "rms_norm_eps": 1e-06,
23
+ "tie_word_embeddings": false,
24
+ "tokenizer_class": "PlamoTokenizer",
25
+ "torch_dtype": "bfloat16",
26
+ "transformers_version": "4.32.0",
27
+ "use_cache": false,
28
+ "vocab_size": 50432
29
+ }
generation_config.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 1,
4
+ "eos_token_id": 2,
5
+ "pad_token_id": 0,
6
+ "transformers_version": "4.32.0",
7
+ "use_cache": false
8
+ }
model-00001-of-00003.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8bfaa9d4d0d02d99ca51ce22aed016442843deda718a17ad33302d9c6840c945
3
+ size 9953775928
model-00002-of-00003.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:61545a05ef8d313bca85939e719ad6b97d4259a70b7a0020d94d6f58ed7e3140
3
+ size 9896104952
model-00003-of-00003.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:63630dca28554349c1d264f7105f73d083081302c808f4d5f408b6a272cc5062
3
+ size 6349249520
model.safetensors.index.json ADDED
@@ -0,0 +1,330 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "metadata": {
3
+ "total_size": 26199091200
4
+ },
5
+ "weight_map": {
6
+ "lm_head.weight": "model-00003-of-00003.safetensors",
7
+ "model.embed_tokens.weight": "model-00001-of-00003.safetensors",
8
+ "model.layers.layers.0.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
9
+ "model.layers.layers.0.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
10
+ "model.layers.layers.0.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
11
+ "model.layers.layers.0.norm.weight": "model-00001-of-00003.safetensors",
12
+ "model.layers.layers.0.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
13
+ "model.layers.layers.0.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
14
+ "model.layers.layers.0.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
15
+ "model.layers.layers.0.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
16
+ "model.layers.layers.1.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
17
+ "model.layers.layers.1.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
18
+ "model.layers.layers.1.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
19
+ "model.layers.layers.1.norm.weight": "model-00001-of-00003.safetensors",
20
+ "model.layers.layers.1.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
21
+ "model.layers.layers.1.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
22
+ "model.layers.layers.1.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
23
+ "model.layers.layers.1.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
24
+ "model.layers.layers.10.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
25
+ "model.layers.layers.10.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
26
+ "model.layers.layers.10.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
27
+ "model.layers.layers.10.norm.weight": "model-00001-of-00003.safetensors",
28
+ "model.layers.layers.10.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
29
+ "model.layers.layers.10.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
30
+ "model.layers.layers.10.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
31
+ "model.layers.layers.10.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
32
+ "model.layers.layers.11.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
33
+ "model.layers.layers.11.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
34
+ "model.layers.layers.11.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
35
+ "model.layers.layers.11.norm.weight": "model-00001-of-00003.safetensors",
36
+ "model.layers.layers.11.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
37
+ "model.layers.layers.11.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
38
+ "model.layers.layers.11.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
39
+ "model.layers.layers.11.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
40
+ "model.layers.layers.12.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
41
+ "model.layers.layers.12.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
42
+ "model.layers.layers.12.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
43
+ "model.layers.layers.12.norm.weight": "model-00001-of-00003.safetensors",
44
+ "model.layers.layers.12.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
45
+ "model.layers.layers.12.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
46
+ "model.layers.layers.12.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
47
+ "model.layers.layers.12.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
48
+ "model.layers.layers.13.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
49
+ "model.layers.layers.13.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
50
+ "model.layers.layers.13.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
51
+ "model.layers.layers.13.norm.weight": "model-00001-of-00003.safetensors",
52
+ "model.layers.layers.13.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
53
+ "model.layers.layers.13.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
54
+ "model.layers.layers.13.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
55
+ "model.layers.layers.13.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
56
+ "model.layers.layers.14.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
57
+ "model.layers.layers.14.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
58
+ "model.layers.layers.14.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
59
+ "model.layers.layers.14.norm.weight": "model-00001-of-00003.safetensors",
60
+ "model.layers.layers.14.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
61
+ "model.layers.layers.14.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
62
+ "model.layers.layers.14.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
63
+ "model.layers.layers.14.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
64
+ "model.layers.layers.15.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
65
+ "model.layers.layers.15.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
66
+ "model.layers.layers.15.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
67
+ "model.layers.layers.15.norm.weight": "model-00002-of-00003.safetensors",
68
+ "model.layers.layers.15.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
69
+ "model.layers.layers.15.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
70
+ "model.layers.layers.15.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
71
+ "model.layers.layers.15.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
72
+ "model.layers.layers.16.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
73
+ "model.layers.layers.16.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
74
+ "model.layers.layers.16.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
75
+ "model.layers.layers.16.norm.weight": "model-00002-of-00003.safetensors",
76
+ "model.layers.layers.16.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
77
+ "model.layers.layers.16.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
78
+ "model.layers.layers.16.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
79
+ "model.layers.layers.16.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
80
+ "model.layers.layers.17.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
81
+ "model.layers.layers.17.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
82
+ "model.layers.layers.17.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
83
+ "model.layers.layers.17.norm.weight": "model-00002-of-00003.safetensors",
84
+ "model.layers.layers.17.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
85
+ "model.layers.layers.17.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
86
+ "model.layers.layers.17.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
87
+ "model.layers.layers.17.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
88
+ "model.layers.layers.18.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
89
+ "model.layers.layers.18.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
90
+ "model.layers.layers.18.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
91
+ "model.layers.layers.18.norm.weight": "model-00002-of-00003.safetensors",
92
+ "model.layers.layers.18.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
93
+ "model.layers.layers.18.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
94
+ "model.layers.layers.18.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
95
+ "model.layers.layers.18.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
96
+ "model.layers.layers.19.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
97
+ "model.layers.layers.19.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
98
+ "model.layers.layers.19.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
99
+ "model.layers.layers.19.norm.weight": "model-00002-of-00003.safetensors",
100
+ "model.layers.layers.19.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
101
+ "model.layers.layers.19.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
102
+ "model.layers.layers.19.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
103
+ "model.layers.layers.19.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
104
+ "model.layers.layers.2.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
105
+ "model.layers.layers.2.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
106
+ "model.layers.layers.2.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
107
+ "model.layers.layers.2.norm.weight": "model-00001-of-00003.safetensors",
108
+ "model.layers.layers.2.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
109
+ "model.layers.layers.2.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
110
+ "model.layers.layers.2.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
111
+ "model.layers.layers.2.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
112
+ "model.layers.layers.20.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
113
+ "model.layers.layers.20.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
114
+ "model.layers.layers.20.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
115
+ "model.layers.layers.20.norm.weight": "model-00002-of-00003.safetensors",
116
+ "model.layers.layers.20.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
117
+ "model.layers.layers.20.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
118
+ "model.layers.layers.20.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
119
+ "model.layers.layers.20.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
120
+ "model.layers.layers.21.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
121
+ "model.layers.layers.21.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
122
+ "model.layers.layers.21.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
123
+ "model.layers.layers.21.norm.weight": "model-00002-of-00003.safetensors",
124
+ "model.layers.layers.21.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
125
+ "model.layers.layers.21.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
126
+ "model.layers.layers.21.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
127
+ "model.layers.layers.21.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
128
+ "model.layers.layers.22.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
129
+ "model.layers.layers.22.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
130
+ "model.layers.layers.22.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
131
+ "model.layers.layers.22.norm.weight": "model-00002-of-00003.safetensors",
132
+ "model.layers.layers.22.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
133
+ "model.layers.layers.22.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
134
+ "model.layers.layers.22.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
135
+ "model.layers.layers.22.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
136
+ "model.layers.layers.23.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
137
+ "model.layers.layers.23.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
138
+ "model.layers.layers.23.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
139
+ "model.layers.layers.23.norm.weight": "model-00002-of-00003.safetensors",
140
+ "model.layers.layers.23.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
141
+ "model.layers.layers.23.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
142
+ "model.layers.layers.23.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
143
+ "model.layers.layers.23.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
144
+ "model.layers.layers.24.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
145
+ "model.layers.layers.24.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
146
+ "model.layers.layers.24.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
147
+ "model.layers.layers.24.norm.weight": "model-00002-of-00003.safetensors",
148
+ "model.layers.layers.24.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
149
+ "model.layers.layers.24.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
150
+ "model.layers.layers.24.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
151
+ "model.layers.layers.24.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
152
+ "model.layers.layers.25.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
153
+ "model.layers.layers.25.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
154
+ "model.layers.layers.25.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
155
+ "model.layers.layers.25.norm.weight": "model-00002-of-00003.safetensors",
156
+ "model.layers.layers.25.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
157
+ "model.layers.layers.25.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
158
+ "model.layers.layers.25.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
159
+ "model.layers.layers.25.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
160
+ "model.layers.layers.26.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
161
+ "model.layers.layers.26.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
162
+ "model.layers.layers.26.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
163
+ "model.layers.layers.26.norm.weight": "model-00002-of-00003.safetensors",
164
+ "model.layers.layers.26.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
165
+ "model.layers.layers.26.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
166
+ "model.layers.layers.26.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
167
+ "model.layers.layers.26.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
168
+ "model.layers.layers.27.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
169
+ "model.layers.layers.27.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
170
+ "model.layers.layers.27.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
171
+ "model.layers.layers.27.norm.weight": "model-00002-of-00003.safetensors",
172
+ "model.layers.layers.27.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
173
+ "model.layers.layers.27.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
174
+ "model.layers.layers.27.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
175
+ "model.layers.layers.27.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
176
+ "model.layers.layers.28.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
177
+ "model.layers.layers.28.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
178
+ "model.layers.layers.28.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
179
+ "model.layers.layers.28.norm.weight": "model-00002-of-00003.safetensors",
180
+ "model.layers.layers.28.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
181
+ "model.layers.layers.28.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
182
+ "model.layers.layers.28.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
183
+ "model.layers.layers.28.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
184
+ "model.layers.layers.29.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
185
+ "model.layers.layers.29.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
186
+ "model.layers.layers.29.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
187
+ "model.layers.layers.29.norm.weight": "model-00002-of-00003.safetensors",
188
+ "model.layers.layers.29.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
189
+ "model.layers.layers.29.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
190
+ "model.layers.layers.29.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
191
+ "model.layers.layers.29.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
192
+ "model.layers.layers.3.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
193
+ "model.layers.layers.3.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
194
+ "model.layers.layers.3.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
195
+ "model.layers.layers.3.norm.weight": "model-00001-of-00003.safetensors",
196
+ "model.layers.layers.3.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
197
+ "model.layers.layers.3.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
198
+ "model.layers.layers.3.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
199
+ "model.layers.layers.3.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
200
+ "model.layers.layers.30.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
201
+ "model.layers.layers.30.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
202
+ "model.layers.layers.30.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
203
+ "model.layers.layers.30.norm.weight": "model-00003-of-00003.safetensors",
204
+ "model.layers.layers.30.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
205
+ "model.layers.layers.30.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
206
+ "model.layers.layers.30.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
207
+ "model.layers.layers.30.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
208
+ "model.layers.layers.31.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
209
+ "model.layers.layers.31.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
210
+ "model.layers.layers.31.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
211
+ "model.layers.layers.31.norm.weight": "model-00003-of-00003.safetensors",
212
+ "model.layers.layers.31.self_attn.k_proj.weight": "model-00003-of-00003.safetensors",
213
+ "model.layers.layers.31.self_attn.o_proj.weight": "model-00003-of-00003.safetensors",
214
+ "model.layers.layers.31.self_attn.q_proj.weight": "model-00003-of-00003.safetensors",
215
+ "model.layers.layers.31.self_attn.v_proj.weight": "model-00003-of-00003.safetensors",
216
+ "model.layers.layers.32.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
217
+ "model.layers.layers.32.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
218
+ "model.layers.layers.32.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
219
+ "model.layers.layers.32.norm.weight": "model-00003-of-00003.safetensors",
220
+ "model.layers.layers.32.self_attn.k_proj.weight": "model-00003-of-00003.safetensors",
221
+ "model.layers.layers.32.self_attn.o_proj.weight": "model-00003-of-00003.safetensors",
222
+ "model.layers.layers.32.self_attn.q_proj.weight": "model-00003-of-00003.safetensors",
223
+ "model.layers.layers.32.self_attn.v_proj.weight": "model-00003-of-00003.safetensors",
224
+ "model.layers.layers.33.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
225
+ "model.layers.layers.33.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
226
+ "model.layers.layers.33.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
227
+ "model.layers.layers.33.norm.weight": "model-00003-of-00003.safetensors",
228
+ "model.layers.layers.33.self_attn.k_proj.weight": "model-00003-of-00003.safetensors",
229
+ "model.layers.layers.33.self_attn.o_proj.weight": "model-00003-of-00003.safetensors",
230
+ "model.layers.layers.33.self_attn.q_proj.weight": "model-00003-of-00003.safetensors",
231
+ "model.layers.layers.33.self_attn.v_proj.weight": "model-00003-of-00003.safetensors",
232
+ "model.layers.layers.34.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
233
+ "model.layers.layers.34.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
234
+ "model.layers.layers.34.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
235
+ "model.layers.layers.34.norm.weight": "model-00003-of-00003.safetensors",
236
+ "model.layers.layers.34.self_attn.k_proj.weight": "model-00003-of-00003.safetensors",
237
+ "model.layers.layers.34.self_attn.o_proj.weight": "model-00003-of-00003.safetensors",
238
+ "model.layers.layers.34.self_attn.q_proj.weight": "model-00003-of-00003.safetensors",
239
+ "model.layers.layers.34.self_attn.v_proj.weight": "model-00003-of-00003.safetensors",
240
+ "model.layers.layers.35.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
241
+ "model.layers.layers.35.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
242
+ "model.layers.layers.35.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
243
+ "model.layers.layers.35.norm.weight": "model-00003-of-00003.safetensors",
244
+ "model.layers.layers.35.self_attn.k_proj.weight": "model-00003-of-00003.safetensors",
245
+ "model.layers.layers.35.self_attn.o_proj.weight": "model-00003-of-00003.safetensors",
246
+ "model.layers.layers.35.self_attn.q_proj.weight": "model-00003-of-00003.safetensors",
247
+ "model.layers.layers.35.self_attn.v_proj.weight": "model-00003-of-00003.safetensors",
248
+ "model.layers.layers.36.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
249
+ "model.layers.layers.36.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
250
+ "model.layers.layers.36.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
251
+ "model.layers.layers.36.norm.weight": "model-00003-of-00003.safetensors",
252
+ "model.layers.layers.36.self_attn.k_proj.weight": "model-00003-of-00003.safetensors",
253
+ "model.layers.layers.36.self_attn.o_proj.weight": "model-00003-of-00003.safetensors",
254
+ "model.layers.layers.36.self_attn.q_proj.weight": "model-00003-of-00003.safetensors",
255
+ "model.layers.layers.36.self_attn.v_proj.weight": "model-00003-of-00003.safetensors",
256
+ "model.layers.layers.37.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
257
+ "model.layers.layers.37.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
258
+ "model.layers.layers.37.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
259
+ "model.layers.layers.37.norm.weight": "model-00003-of-00003.safetensors",
260
+ "model.layers.layers.37.self_attn.k_proj.weight": "model-00003-of-00003.safetensors",
261
+ "model.layers.layers.37.self_attn.o_proj.weight": "model-00003-of-00003.safetensors",
262
+ "model.layers.layers.37.self_attn.q_proj.weight": "model-00003-of-00003.safetensors",
263
+ "model.layers.layers.37.self_attn.v_proj.weight": "model-00003-of-00003.safetensors",
264
+ "model.layers.layers.38.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
265
+ "model.layers.layers.38.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
266
+ "model.layers.layers.38.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
267
+ "model.layers.layers.38.norm.weight": "model-00003-of-00003.safetensors",
268
+ "model.layers.layers.38.self_attn.k_proj.weight": "model-00003-of-00003.safetensors",
269
+ "model.layers.layers.38.self_attn.o_proj.weight": "model-00003-of-00003.safetensors",
270
+ "model.layers.layers.38.self_attn.q_proj.weight": "model-00003-of-00003.safetensors",
271
+ "model.layers.layers.38.self_attn.v_proj.weight": "model-00003-of-00003.safetensors",
272
+ "model.layers.layers.39.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
273
+ "model.layers.layers.39.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
274
+ "model.layers.layers.39.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
275
+ "model.layers.layers.39.norm.weight": "model-00003-of-00003.safetensors",
276
+ "model.layers.layers.39.self_attn.k_proj.weight": "model-00003-of-00003.safetensors",
277
+ "model.layers.layers.39.self_attn.o_proj.weight": "model-00003-of-00003.safetensors",
278
+ "model.layers.layers.39.self_attn.q_proj.weight": "model-00003-of-00003.safetensors",
279
+ "model.layers.layers.39.self_attn.v_proj.weight": "model-00003-of-00003.safetensors",
280
+ "model.layers.layers.4.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
281
+ "model.layers.layers.4.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
282
+ "model.layers.layers.4.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
283
+ "model.layers.layers.4.norm.weight": "model-00001-of-00003.safetensors",
284
+ "model.layers.layers.4.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
285
+ "model.layers.layers.4.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
286
+ "model.layers.layers.4.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
287
+ "model.layers.layers.4.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
288
+ "model.layers.layers.5.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
289
+ "model.layers.layers.5.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
290
+ "model.layers.layers.5.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
291
+ "model.layers.layers.5.norm.weight": "model-00001-of-00003.safetensors",
292
+ "model.layers.layers.5.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
293
+ "model.layers.layers.5.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
294
+ "model.layers.layers.5.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
295
+ "model.layers.layers.5.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
296
+ "model.layers.layers.6.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
297
+ "model.layers.layers.6.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
298
+ "model.layers.layers.6.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
299
+ "model.layers.layers.6.norm.weight": "model-00001-of-00003.safetensors",
300
+ "model.layers.layers.6.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
301
+ "model.layers.layers.6.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
302
+ "model.layers.layers.6.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
303
+ "model.layers.layers.6.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
304
+ "model.layers.layers.7.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
305
+ "model.layers.layers.7.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
306
+ "model.layers.layers.7.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
307
+ "model.layers.layers.7.norm.weight": "model-00001-of-00003.safetensors",
308
+ "model.layers.layers.7.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
309
+ "model.layers.layers.7.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
310
+ "model.layers.layers.7.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
311
+ "model.layers.layers.7.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
312
+ "model.layers.layers.8.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
313
+ "model.layers.layers.8.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
314
+ "model.layers.layers.8.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
315
+ "model.layers.layers.8.norm.weight": "model-00001-of-00003.safetensors",
316
+ "model.layers.layers.8.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
317
+ "model.layers.layers.8.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
318
+ "model.layers.layers.8.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
319
+ "model.layers.layers.8.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
320
+ "model.layers.layers.9.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
321
+ "model.layers.layers.9.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
322
+ "model.layers.layers.9.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
323
+ "model.layers.layers.9.norm.weight": "model-00001-of-00003.safetensors",
324
+ "model.layers.layers.9.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
325
+ "model.layers.layers.9.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
326
+ "model.layers.layers.9.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
327
+ "model.layers.layers.9.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
328
+ "model.norm.weight": "model-00003-of-00003.safetensors"
329
+ }
330
+ }
modeling_plamo.py ADDED
@@ -0,0 +1,705 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from typing import Any, Dict, List, NamedTuple, Optional, Tuple, Union
2
+
3
+ import numpy as np
4
+ import torch
5
+ from torch import nn
6
+ from torch.nn import functional as F
7
+ from transformers import PretrainedConfig, PreTrainedModel
8
+ from transformers.modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast
9
+
10
+
11
+ class DecoderInput(NamedTuple):
12
+ hidden_states: torch.Tensor
13
+ position_ids: torch.Tensor
14
+ attention_mask: Optional[torch.Tensor] = None
15
+ past_key_values: Optional[List[torch.FloatTensor]] = None
16
+ output_hidden_states: Optional[bool] = False
17
+ output_attentions: Optional[bool] = False
18
+ use_cache: Optional[bool] = False
19
+ gradient_checkpointing: bool = False
20
+
21
+
22
+ class DecoderOutput(NamedTuple):
23
+ hidden_states: torch.Tensor
24
+ all_hidden_states: Optional[Tuple[torch.Tensor, ...]]
25
+ all_self_attns: Optional[Tuple[torch.Tensor, ...]]
26
+ next_decoder_cache: Optional[Tuple[torch.Tensor, ...]]
27
+
28
+
29
+ class PlamoConfig(PretrainedConfig): # type: ignore
30
+ model_type: str = "plamo"
31
+
32
+ def __init__(
33
+ self,
34
+ vocab_size: int = 32000,
35
+ hidden_size: int = 4096,
36
+ intermediate_size: int = 13312,
37
+ num_hidden_layers: int = 32,
38
+ num_attention_heads: int = 32,
39
+ num_key_value_heads: Optional[int] = None,
40
+ max_position_embeddings: int = 2048,
41
+ initializer_range: float = 0.02,
42
+ rms_norm_eps: float = 1e-6,
43
+ use_cache: bool = True,
44
+ tokenizer_class: str = "PlamoTokenizer",
45
+ pad_token_id: Optional[int] = None,
46
+ bos_token_id: int = 1,
47
+ eos_token_id: int = 2,
48
+ n_shared_head: int = 8,
49
+ tie_word_embeddings: bool = False,
50
+ **kwargs: Any,
51
+ ) -> None:
52
+ self.vocab_size = vocab_size
53
+ self.max_position_embeddings = max_position_embeddings
54
+ self.hidden_size = hidden_size
55
+ self.intermediate_size = intermediate_size
56
+ self.num_hidden_layers = num_hidden_layers
57
+ self.num_attention_heads = num_attention_heads
58
+
59
+ # for backward compatibility
60
+ if num_key_value_heads is None:
61
+ num_key_value_heads = num_attention_heads
62
+
63
+ self.num_key_value_heads = num_key_value_heads
64
+ self.initializer_range = initializer_range
65
+ self.rms_norm_eps = rms_norm_eps
66
+ self.use_cache = use_cache
67
+
68
+ self.n_shared_head = n_shared_head
69
+
70
+ super().__init__(
71
+ tokenizer_class=tokenizer_class,
72
+ pad_token_id=pad_token_id,
73
+ bos_token_id=bos_token_id,
74
+ eos_token_id=eos_token_id,
75
+ tie_word_embeddings=tie_word_embeddings,
76
+ **kwargs,
77
+ )
78
+
79
+
80
+ # Copied from transformers.models.bart.modeling_bart._make_causal_mask
81
+ def _make_causal_mask(
82
+ input_ids_shape: Tuple[int, int], dtype: torch.dtype, device: torch.device, past_key_values_length: int = 0
83
+ ) -> torch.Tensor:
84
+ """
85
+ Make causal mask used for bi-directional self-attention.
86
+ """
87
+ bsz, tgt_len = input_ids_shape
88
+ mask = torch.full((tgt_len, tgt_len), torch.finfo(dtype).min, device=device)
89
+ mask_cond = torch.arange(mask.size(-1), device=device)
90
+ mask.masked_fill_(mask_cond < (mask_cond + 1).view(mask.size(-1), 1), 0)
91
+ mask = mask.to(dtype)
92
+
93
+ if past_key_values_length > 0:
94
+ mask = torch.cat([torch.zeros(tgt_len, past_key_values_length, dtype=dtype, device=device), mask], dim=-1)
95
+ return mask[None, None, :, :].expand(bsz, 1, tgt_len, tgt_len + past_key_values_length)
96
+
97
+
98
+ # Copied from transformers.models.bart.modeling_bart._expand_mask
99
+ def _expand_mask(mask: torch.Tensor, dtype: torch.dtype, tgt_len: Optional[int] = None) -> torch.Tensor:
100
+ """
101
+ Expands attention_mask from `[bsz, seq_len]` to `[bsz, 1, tgt_seq_len, src_seq_len]`.
102
+ """
103
+ bsz, src_len = mask.size()
104
+ tgt_len = tgt_len if tgt_len is not None else src_len
105
+
106
+ expanded_mask = mask[:, None, None, :].expand(bsz, 1, tgt_len, src_len).to(dtype)
107
+
108
+ inverted_mask = 1.0 - expanded_mask
109
+
110
+ return inverted_mask.masked_fill(inverted_mask.to(torch.bool), torch.finfo(dtype).min) # type: ignore
111
+
112
+
113
+ class RotaryEmbedding(torch.nn.Module):
114
+ def __init__(
115
+ self, dim: int, max_position_embeddings: int = 2048, base: int = 10000, device: Optional[torch.device] = None
116
+ ) -> None:
117
+ super().__init__()
118
+
119
+ self.dim = dim
120
+ self.max_position_embeddings = max_position_embeddings
121
+ self.base = base
122
+ inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
123
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
124
+
125
+ # Build here to make `torch.jit.trace` work.
126
+ self._set_cos_sin_cache(
127
+ seq_len=max_position_embeddings, device=self.inv_freq.device, dtype=torch.get_default_dtype()
128
+ )
129
+
130
+ def _set_cos_sin_cache(self, seq_len: int, device: Any, dtype: Any) -> None:
131
+ self.max_seq_len_cached = seq_len
132
+ t = torch.arange(self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype) # type: ignore
133
+
134
+ freqs = torch.einsum("i,j->ij", t, self.inv_freq)
135
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
136
+ emb = torch.cat((freqs, freqs), dim=-1)
137
+ self.register_buffer("cos_cached", emb.cos()[None, None, :, :].to(dtype), persistent=False)
138
+ self.register_buffer("sin_cached", emb.sin()[None, None, :, :].to(dtype), persistent=False)
139
+
140
+ def forward(self, x: torch.Tensor, seq_len: int) -> Tuple[torch.Tensor, torch.Tensor]:
141
+ # x: [bs, num_attention_heads, seq_len, head_size]
142
+ if seq_len > self.max_seq_len_cached:
143
+ self._set_cos_sin_cache(seq_len=seq_len, device=x.device, dtype=x.dtype)
144
+
145
+ return (
146
+ self.cos_cached[:, :, :seq_len, ...].to(dtype=x.dtype), # type: ignore
147
+ self.sin_cached[:, :, :seq_len, ...].to(dtype=x.dtype), # type: ignore
148
+ )
149
+
150
+
151
+ def _rotate_half(x: torch.Tensor) -> torch.Tensor:
152
+ """Rotates half the hidden dims of the input."""
153
+ x1 = x[..., : x.shape[-1] // 2]
154
+ x2 = x[..., x.shape[-1] // 2 :]
155
+ return torch.cat((-x2, x1), dim=-1)
156
+
157
+
158
+ def _rotary_pos_emb(x: torch.Tensor, cos: torch.Tensor, sin: torch.Tensor, position_ids: torch.Tensor) -> torch.Tensor:
159
+ # The first two dimensions of cos and sin are always 1, so we can `squeeze` them.
160
+ cos = cos.squeeze(1).squeeze(0) # [seq_len, dim]
161
+ sin = sin.squeeze(1).squeeze(0) # [seq_len, dim]
162
+ cos = cos[position_ids].unsqueeze(1) # [bs, 1, seq_len, dim]
163
+ sin = sin[position_ids].unsqueeze(1) # [bs, 1, seq_len, dim]
164
+ x_embed = (x * cos) + (_rotate_half(x) * sin)
165
+ return x_embed
166
+
167
+
168
+ class RMSNorm(nn.Module):
169
+ def __init__(self, hidden_size: int, eps: float = 1e-6) -> None:
170
+ super().__init__()
171
+ self.weight = nn.Parameter(torch.ones(hidden_size))
172
+ self.variance_epsilon = eps
173
+
174
+ def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
175
+ input_dtype = hidden_states.dtype
176
+ hidden_states = hidden_states.to(torch.float32)
177
+ variance = hidden_states.pow(2).mean(-1, keepdim=True)
178
+ hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
179
+ return self.weight * hidden_states.to(input_dtype)
180
+
181
+
182
+ class Attention(torch.nn.Module):
183
+ def __init__(self, config: PlamoConfig) -> None:
184
+ super().__init__()
185
+ self.config = config
186
+ self.hidden_size = config.hidden_size
187
+ head_dim = self.hidden_size // config.num_attention_heads
188
+ self.max_position_embeddings = config.max_position_embeddings
189
+
190
+ self.q_num_heads = config.num_attention_heads
191
+ self.qk_dim = self.v_dim = head_dim
192
+ self.k_num_heads = self.v_num_heads = int(np.ceil(self.q_num_heads / config.n_shared_head))
193
+
194
+ self.q_proj = nn.Linear(self.hidden_size, self.q_num_heads * self.qk_dim, bias=False)
195
+ self.k_proj = nn.Linear(self.hidden_size, self.k_num_heads * self.qk_dim, bias=False)
196
+ self.v_proj = nn.Linear(self.hidden_size, self.v_num_heads * self.v_dim, bias=False)
197
+ self.o_proj = nn.Linear(self.q_num_heads * self.v_dim, self.hidden_size, bias=False)
198
+ self.rotary_emb = RotaryEmbedding(self.qk_dim, max_position_embeddings=self.max_position_embeddings)
199
+
200
+ def forward(
201
+ self,
202
+ hidden_states: torch.Tensor,
203
+ attention_mask: Optional[torch.Tensor] = None,
204
+ position_ids: Optional[torch.Tensor] = None,
205
+ past_key_value: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
206
+ output_attentions: bool = False,
207
+ use_cache: bool = False,
208
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor, torch.Tensor]]]:
209
+ bsz, q_len, _ = hidden_states.size()
210
+
211
+ query_states = self.q_proj(hidden_states).view(bsz, q_len, self.q_num_heads, self.qk_dim).transpose(1, 2)
212
+ key_states = self.k_proj(hidden_states).view(bsz, q_len, self.k_num_heads, self.qk_dim).transpose(1, 2)
213
+ value_states = self.v_proj(hidden_states).view(bsz, q_len, self.v_num_heads, self.v_dim).transpose(1, 2)
214
+
215
+ def _expand_kv(t: torch.Tensor, repeat: int, target: int) -> torch.Tensor:
216
+ return t.repeat(1, repeat, 1, 1)[:, :target]
217
+
218
+ # expand shared kv
219
+ assert self.k_num_heads == self.v_num_heads
220
+ key_states = _expand_kv(key_states, self.config.n_shared_head, self.q_num_heads)
221
+ value_states = _expand_kv(value_states, self.config.n_shared_head, self.q_num_heads)
222
+
223
+ kv_seq_len = key_states.shape[-2]
224
+ if past_key_value is not None:
225
+ kv_seq_len += past_key_value[0].shape[-2]
226
+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
227
+ assert position_ids is not None
228
+ query_states = _rotary_pos_emb(query_states, cos, sin, position_ids)
229
+ key_states = _rotary_pos_emb(key_states, cos, sin, position_ids)
230
+ # [bsz, nh, t, hd]
231
+
232
+ if past_key_value is not None:
233
+ # reuse k, v, self_attention
234
+ key_states = torch.cat([past_key_value[0], key_states], dim=2)
235
+ value_states = torch.cat([past_key_value[1], value_states], dim=2)
236
+
237
+ past_key_value = (key_states, value_states) if use_cache else None
238
+
239
+ attn_output = F.scaled_dot_product_attention(query_states, key_states, value_states, attn_mask=attention_mask)
240
+ attn_output = attn_output.transpose(1, 2)
241
+
242
+ attn_output = attn_output.reshape(bsz, q_len, self.q_num_heads * self.v_dim)
243
+ attn_output = self.o_proj(attn_output)
244
+
245
+ if not output_attentions:
246
+ attn_weights = None
247
+
248
+ return attn_output, attn_weights, past_key_value
249
+
250
+
251
+ class MLP(nn.Module):
252
+ def __init__(self, config: PlamoConfig) -> None:
253
+ super().__init__()
254
+ self.config = config
255
+ self.hidden_size = config.hidden_size
256
+ self.intermediate_size = config.intermediate_size
257
+ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
258
+ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
259
+ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
260
+ self.act_fn = torch.nn.functional.silu
261
+
262
+ def forward(self, x: torch.Tensor) -> torch.Tensor:
263
+ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) # type: ignore
264
+
265
+
266
+ class PlamoDecoderLayer(torch.nn.Module):
267
+ def __init__(self, config: PlamoConfig) -> None:
268
+ super().__init__()
269
+ self.config = config
270
+ self.hidden_size = config.hidden_size
271
+ self.self_attn = Attention(config)
272
+ self.mlp = MLP(config)
273
+ self.norm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
274
+
275
+ def forward(
276
+ self,
277
+ hidden_states: torch.Tensor,
278
+ attention_mask: Optional[torch.Tensor] = None,
279
+ position_ids: Optional[torch.LongTensor] = None,
280
+ past_key_value: Optional[Tuple[torch.Tensor]] = None,
281
+ output_attentions: Optional[bool] = False,
282
+ use_cache: Optional[bool] = False,
283
+ ) -> Tuple[Any, ...]:
284
+ # from LlamaDecoder
285
+ residual = hidden_states
286
+
287
+ hidden_states = self.norm(hidden_states)
288
+
289
+ # Self Attention
290
+ hidden_states_sa, self_attn_weights, present_key_value = self.self_attn(
291
+ hidden_states=hidden_states,
292
+ attention_mask=attention_mask,
293
+ position_ids=position_ids,
294
+ past_key_value=past_key_value,
295
+ output_attentions=output_attentions,
296
+ use_cache=use_cache,
297
+ )
298
+
299
+ # Fully Connected
300
+ hidden_states_mlp = self.mlp(hidden_states)
301
+
302
+ # Residual
303
+ hidden_states = residual + hidden_states_sa + hidden_states_mlp
304
+
305
+ outputs: Any = (hidden_states,)
306
+
307
+ if output_attentions:
308
+ outputs += (self_attn_weights,)
309
+
310
+ if use_cache:
311
+ outputs += (present_key_value,)
312
+
313
+ return outputs # type: ignore
314
+
315
+
316
+ class PlamoDecoder(torch.nn.Module):
317
+ def __init__(self, config: PlamoConfig) -> None:
318
+ super().__init__()
319
+ self.layers = torch.nn.ModuleList([PlamoDecoderLayer(config) for _ in range(config.num_hidden_layers)])
320
+
321
+ def forward(self, x: DecoderInput) -> DecoderOutput:
322
+ all_hidden_states: Optional[Tuple[torch.Tensor, ...]] = () if x.output_hidden_states else None
323
+ all_self_attns: Optional[Tuple[torch.Tensor, ...]] = () if x.output_attentions else None
324
+ next_decoder_cache: Optional[Tuple[torch.Tensor, ...]] = () if x.use_cache else None
325
+ hidden_states = x.hidden_states
326
+
327
+ for idx, decoder_layer in enumerate(self.layers):
328
+ if x.output_hidden_states:
329
+ assert all_hidden_states is not None
330
+ all_hidden_states += (hidden_states,)
331
+
332
+ past_key_value = x.past_key_values[idx] if x.past_key_values is not None else None
333
+
334
+ if self.training and x.gradient_checkpointing:
335
+
336
+ def create_custom_forward(module): # type: ignore
337
+ def custom_forward(*inputs): # type: ignore
338
+ # None for past_key_value
339
+ return module(*inputs, x.output_attentions, None)
340
+
341
+ return custom_forward
342
+
343
+ layer_outputs = torch.utils.checkpoint.checkpoint(
344
+ create_custom_forward(decoder_layer), # type: ignore
345
+ hidden_states,
346
+ x.attention_mask,
347
+ x.position_ids,
348
+ None,
349
+ )
350
+ else:
351
+ layer_outputs = decoder_layer(
352
+ hidden_states,
353
+ attention_mask=x.attention_mask,
354
+ position_ids=x.position_ids,
355
+ past_key_value=past_key_value,
356
+ output_attentions=x.output_attentions,
357
+ use_cache=x.use_cache,
358
+ )
359
+
360
+ hidden_states = layer_outputs[0]
361
+
362
+ if x.use_cache:
363
+ cache = layer_outputs[2 if x.output_attentions else 1]
364
+ assert cache is not None
365
+ assert next_decoder_cache is not None
366
+ next_decoder_cache += (cache,)
367
+
368
+ if x.output_attentions:
369
+ assert layer_outputs[1] is not None
370
+ assert all_self_attns is not None
371
+ all_self_attns += (layer_outputs[1],)
372
+ return DecoderOutput(hidden_states, all_hidden_states, all_self_attns, next_decoder_cache)
373
+
374
+
375
+ class PlamoPreTrainedModel(PreTrainedModel): # type: ignore
376
+ config_class = PlamoConfig
377
+ _no_split_modules: List[str]
378
+ base_model_prefix = "model"
379
+ supports_gradient_checkpointing = True
380
+ _no_split_modules = ["PlamoDecoderLayer"]
381
+ _skip_keys_device_placement = "past_key_values"
382
+ _keys_to_ignore_on_load_unexpected = [r"decoder\.version"]
383
+
384
+ def _init_weights(self, module: torch.nn.Module) -> None:
385
+ std = self.config.initializer_range
386
+ if isinstance(module, nn.Linear):
387
+ module.weight.data.normal_(mean=0.0, std=std)
388
+ if module.bias is not None:
389
+ module.bias.data.zero_()
390
+ elif isinstance(module, nn.Embedding):
391
+ module.weight.data.normal_(mean=0.0, std=std)
392
+ if module.padding_idx is not None:
393
+ module.weight.data[module.padding_idx].zero_()
394
+
395
+ def _set_gradient_checkpointing(self, module: torch.nn.Module, value: bool = False) -> None:
396
+ module.gradient_checkpointing = value # type: ignore
397
+
398
+
399
+ class PlamoModel(PlamoPreTrainedModel):
400
+ def __init__(self, config: PlamoConfig):
401
+ super().__init__(config)
402
+ self.padding_idx = config.pad_token_id
403
+ self.vocab_size = config.vocab_size
404
+
405
+ self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
406
+ self.layers = PlamoDecoder(config) # type: ignore
407
+ self.norm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
408
+
409
+ self.gradient_checkpointing = False
410
+ # Initialize weights and apply final processing
411
+ self.post_init()
412
+
413
+ def get_input_embeddings(self) -> torch.nn.Embedding:
414
+ return self.embed_tokens
415
+
416
+ def set_input_embeddings(self, value: torch.nn.Embedding) -> None:
417
+ self.embed_tokens = value
418
+
419
+ # Copied from transformers.models.bart.modeling_bart.BartDecoder._prepare_decoder_attention_mask
420
+ def _prepare_decoder_attention_mask(
421
+ self,
422
+ attention_mask: torch.Tensor,
423
+ input_shape: Tuple[int, int],
424
+ inputs_embeds: Optional[torch.FloatTensor],
425
+ past_key_values_length: int,
426
+ ) -> Optional[torch.Tensor]:
427
+ # create causal mask
428
+ # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
429
+ combined_attention_mask: Optional[torch.Tensor] = None
430
+ if input_shape[-1] > 1:
431
+ assert inputs_embeds is not None
432
+ combined_attention_mask = _make_causal_mask(
433
+ input_shape,
434
+ inputs_embeds.dtype,
435
+ device=inputs_embeds.device,
436
+ past_key_values_length=past_key_values_length,
437
+ )
438
+
439
+ if attention_mask is not None:
440
+ # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
441
+ assert inputs_embeds is not None
442
+ expanded_attn_mask = _expand_mask(attention_mask, inputs_embeds.dtype, tgt_len=input_shape[-1]).to(
443
+ inputs_embeds.device
444
+ )
445
+ combined_attention_mask = (
446
+ expanded_attn_mask if combined_attention_mask is None else expanded_attn_mask + combined_attention_mask
447
+ )
448
+
449
+ return combined_attention_mask
450
+
451
+ def forward(
452
+ self,
453
+ input_ids: Optional[torch.LongTensor] = None,
454
+ attention_mask: Optional[torch.Tensor] = None,
455
+ position_ids: Optional[torch.Tensor] = None,
456
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
457
+ inputs_embeds: Optional[torch.FloatTensor] = None,
458
+ use_cache: Optional[bool] = None,
459
+ output_attentions: Optional[bool] = None,
460
+ output_hidden_states: Optional[bool] = None,
461
+ return_dict: Optional[bool] = None,
462
+ ) -> Union[Tuple, BaseModelOutputWithPast]:
463
+ assert input_ids is not None
464
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
465
+ output_hidden_states = (
466
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
467
+ )
468
+ use_cache = use_cache if use_cache is not None else self.config.use_cache
469
+
470
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
471
+
472
+ # retrieve input_ids and inputs_embeds
473
+ if input_ids is not None and inputs_embeds is not None:
474
+ raise ValueError("You cannot specify both decoder_input_ids and decoder_inputs_embeds at the same time")
475
+ elif input_ids is not None:
476
+ batch_size, seq_length = input_ids.shape
477
+ else:
478
+ raise ValueError("You have to specify either decoder_input_ids or decoder_inputs_embeds")
479
+
480
+ seq_length_with_past = seq_length
481
+ past_key_values_length = 0
482
+
483
+ if past_key_values is not None:
484
+ past_key_values_length = past_key_values[0][0].shape[2]
485
+ seq_length_with_past = seq_length_with_past + past_key_values_length
486
+
487
+ if position_ids is None:
488
+ device = input_ids.device
489
+ position_ids = torch.arange(
490
+ past_key_values_length, seq_length + past_key_values_length, dtype=torch.long, device=device
491
+ )
492
+ position_ids = position_ids.unsqueeze(0).view(-1, seq_length)
493
+ else:
494
+ position_ids = position_ids.view(-1, seq_length).long()
495
+
496
+ if inputs_embeds is None:
497
+ inputs_embeds = self.embed_tokens(input_ids)
498
+ # embed positions
499
+ if attention_mask is None:
500
+ attention_mask = torch.ones(
501
+ (batch_size, seq_length_with_past), dtype=torch.bool, device=inputs_embeds.device
502
+ )
503
+ attention_mask = self._prepare_decoder_attention_mask(
504
+ attention_mask, (batch_size, seq_length), inputs_embeds, past_key_values_length
505
+ )
506
+
507
+ hidden_states = inputs_embeds
508
+
509
+ if self.gradient_checkpointing and self.training:
510
+ if use_cache:
511
+ use_cache = False
512
+
513
+ # decoder layers
514
+ out = self.layers(
515
+ DecoderInput(
516
+ hidden_states,
517
+ position_ids,
518
+ attention_mask,
519
+ past_key_values,
520
+ output_hidden_states,
521
+ output_attentions,
522
+ use_cache,
523
+ self.gradient_checkpointing,
524
+ )
525
+ )
526
+ assert isinstance(out, DecoderOutput)
527
+ hidden_states = out.hidden_states
528
+ all_hidden_states = out.all_hidden_states
529
+ all_self_attns = out.all_self_attns
530
+ next_decoder_cache = out.next_decoder_cache
531
+
532
+ hidden_states = self.norm(hidden_states)
533
+
534
+ # add hidden states from the last decoder layer
535
+ if output_hidden_states:
536
+ assert all_hidden_states is not None
537
+ all_hidden_states += (hidden_states,)
538
+
539
+ next_cache = next_decoder_cache if use_cache else None
540
+ if not return_dict:
541
+ return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None)
542
+ return BaseModelOutputWithPast(
543
+ last_hidden_state=hidden_states,
544
+ past_key_values=next_cache,
545
+ hidden_states=all_hidden_states,
546
+ attentions=all_self_attns,
547
+ )
548
+
549
+
550
+ class PlamoForCausalLM(PlamoPreTrainedModel):
551
+ def __init__(self, config: PretrainedConfig) -> None:
552
+ super().__init__(config)
553
+ self.model = PlamoModel(config)
554
+
555
+ self.lm_head: torch.nn.Module = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
556
+
557
+ # Initialize weights and apply final processing
558
+ self.post_init()
559
+
560
+ def get_input_embeddings(self) -> torch.nn.Embedding:
561
+ return self.model.embed_tokens
562
+
563
+ def set_input_embeddings(self, value: torch.nn.Embedding) -> None:
564
+ self.model.embed_tokens = value
565
+
566
+ def get_output_embeddings(self) -> torch.nn.Module:
567
+ return self.lm_head
568
+
569
+ def set_output_embeddings(self, new_embeddings: torch.nn.Module) -> None:
570
+ self.lm_head = new_embeddings
571
+
572
+ def set_decoder(self, decoder: PlamoModel) -> None:
573
+ self.model = decoder
574
+
575
+ def get_decoder(self) -> PlamoModel:
576
+ return self.model
577
+
578
+ def forward( # type: ignore
579
+ self,
580
+ input_ids: Optional[torch.LongTensor] = None,
581
+ attention_mask: Optional[torch.Tensor] = None,
582
+ position_ids: Optional[torch.Tensor] = None,
583
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
584
+ inputs_embeds: Optional[torch.FloatTensor] = None,
585
+ labels: Optional[torch.LongTensor] = None,
586
+ use_cache: Optional[bool] = None,
587
+ output_attentions: Optional[bool] = None,
588
+ output_hidden_states: Optional[bool] = None,
589
+ return_dict: Optional[bool] = None,
590
+ ) -> Union[Tuple, CausalLMOutputWithPast]:
591
+ r"""
592
+ Args:
593
+ labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
594
+ Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
595
+ config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
596
+ (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
597
+
598
+ Returns:
599
+
600
+ Example:
601
+
602
+ ```python
603
+ >>> from transformers import AutoTokenizer, LlamaForCausalLM
604
+
605
+ >>> model = LlamaForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
606
+ >>> tokenizer = AutoTokenizer.from_pretrained(PATH_TO_CONVERTED_TOKENIZER)
607
+
608
+ >>> prompt = "Hey, are you consciours? Can you talk to me?"
609
+ >>> inputs = tokenizer(prompt, return_tensors="pt")
610
+
611
+ >>> # Generate
612
+ >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
613
+ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
614
+ "Hey, are you consciours? Can you talk to me?\nI'm not consciours, but I can talk to you."
615
+ ```"""
616
+ assert input_ids is not None
617
+
618
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
619
+ output_hidden_states = (
620
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
621
+ )
622
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
623
+
624
+ # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
625
+ outputs = self.model(
626
+ input_ids=input_ids,
627
+ attention_mask=attention_mask,
628
+ position_ids=position_ids,
629
+ past_key_values=past_key_values,
630
+ inputs_embeds=inputs_embeds,
631
+ use_cache=use_cache,
632
+ output_attentions=output_attentions,
633
+ output_hidden_states=output_hidden_states,
634
+ return_dict=return_dict,
635
+ )
636
+
637
+ hidden_states = outputs[0]
638
+ logits = self.lm_head(hidden_states)
639
+
640
+ loss = None
641
+ if labels is not None:
642
+ # Shift so that tokens < n predict n
643
+ shift_logits = logits[..., :-1, :].contiguous()
644
+ shift_labels = labels[..., 1:].contiguous()
645
+ # Flatten the tokens
646
+ loss_fct = nn.CrossEntropyLoss()
647
+ shift_logits = shift_logits.view(-1, self.config.vocab_size)
648
+ shift_labels = shift_labels.view(-1)
649
+ # Enable model parallelism
650
+ shift_labels = shift_labels.to(shift_logits.device)
651
+ loss = loss_fct(shift_logits, shift_labels)
652
+
653
+ if not return_dict:
654
+ output = (logits,) + outputs[1:]
655
+ return (loss,) + output if loss is not None else output
656
+
657
+ return CausalLMOutputWithPast(
658
+ loss=loss,
659
+ logits=logits,
660
+ past_key_values=outputs.past_key_values,
661
+ hidden_states=outputs.hidden_states,
662
+ attentions=outputs.attentions,
663
+ )
664
+
665
+ def prepare_inputs_for_generation(
666
+ self,
667
+ input_ids: torch.Tensor,
668
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
669
+ attention_mask: Optional[torch.Tensor] = None,
670
+ inputs_embeds: Optional[torch.Tensor] = None,
671
+ **kwargs: Any,
672
+ ) -> Dict[str, Any]:
673
+ if past_key_values:
674
+ input_ids = input_ids[:, -1:]
675
+
676
+ position_ids = kwargs.get("position_ids", None)
677
+ if attention_mask is not None and position_ids is None:
678
+ # create position_ids on the fly for batch generation
679
+ position_ids = attention_mask.long().cumsum(-1) - 1
680
+ position_ids.masked_fill_(attention_mask == 0, 1)
681
+ if past_key_values:
682
+ position_ids = position_ids[:, -1].unsqueeze(-1)
683
+
684
+ # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
685
+ if inputs_embeds is not None and past_key_values is None:
686
+ model_inputs: Dict[str, Any] = {"inputs_embeds": inputs_embeds}
687
+ else:
688
+ model_inputs = {"input_ids": input_ids}
689
+
690
+ model_inputs.update(
691
+ {
692
+ "position_ids": position_ids,
693
+ "past_key_values": past_key_values,
694
+ "use_cache": kwargs.get("use_cache"),
695
+ "attention_mask": attention_mask,
696
+ }
697
+ )
698
+ return model_inputs
699
+
700
+ @staticmethod
701
+ def _reorder_cache(past_key_values: List[torch.FloatTensor], beam_idx: int) -> Tuple[Any, ...]:
702
+ reordered_past: Tuple[Any, ...] = ()
703
+ for layer_past in past_key_values:
704
+ reordered_past += (tuple(past_state.index_select(0, beam_idx) for past_state in layer_past),)
705
+ return reordered_past
pytorch_model-00001-of-00003.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9f70a7773f17511aa6dd08f5d2175b4d03dfa6319dd169bba41fe55577666b31
3
+ size 9953812631
pytorch_model-00002-of-00003.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1da04d8f13b66bd506926b75ed95f2359c1b365c1147e776eb0bd0278e9938fe
3
+ size 9896143321
pytorch_model-00003-of-00003.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:600546291f28e1d6fd77442909ebf30f353c5ad2bbfee887b70ac0c0361aed29
3
+ size 6349271849
special_tokens_map.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<s>",
3
+ "cls_token": "<cls>",
4
+ "eos_token": "</s>",
5
+ "mask_token": "<mask>",
6
+ "pad_token": "<pad>",
7
+ "sep_token": "<sep>",
8
+ "unk_token": "<unk>"
9
+ }
tokenization_plamo.py ADDED
@@ -0,0 +1,159 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ from shutil import copyfile
3
+ from typing import Any, Dict, List, Optional, Tuple
4
+
5
+ import sentencepiece as spm
6
+ import transformers
7
+ from transformers.tokenization_utils import PreTrainedTokenizer
8
+ from transformers.utils import logging
9
+
10
+ VOCAB_FILES_NAMES = {"vocab_file": "tokenizer.model"}
11
+ logger = logging.get_logger(__name__)
12
+
13
+
14
+ class PlamoTokenizer(PreTrainedTokenizer): # type: ignore
15
+ vocab_files_names = VOCAB_FILES_NAMES
16
+ model_input_names = ["input_ids", "attention_mask"]
17
+
18
+ def __init__(
19
+ self,
20
+ vocab_file: str,
21
+ unk_token: str = "<unk>",
22
+ bos_token: str = "<s>",
23
+ eos_token: str = "</s>",
24
+ pad_token: str = "<pad>",
25
+ cls_token: str = "<cls>",
26
+ sep_token: str = "<sep>",
27
+ mask_token: str = "<mask>",
28
+ sp_model_kwargs: Optional[Dict[str, Any]] = None,
29
+ clean_up_tokenization_spaces: bool = False,
30
+ **kwargs: Any,
31
+ ) -> None:
32
+ if "add_bos_token" not in kwargs:
33
+ kwargs["add_bos_token"] = False
34
+ if "add_eos_token" not in kwargs:
35
+ kwargs["add_eos_token"] = False
36
+
37
+ super().__init__(
38
+ vocab_file=vocab_file,
39
+ unk_token=unk_token,
40
+ bos_token=bos_token,
41
+ eos_token=eos_token,
42
+ pad_token=pad_token,
43
+ cls_token=cls_token,
44
+ sep_token=sep_token,
45
+ mask_token=mask_token,
46
+ sp_model_kwargs=sp_model_kwargs,
47
+ clean_up_tokenization_spaces=clean_up_tokenization_spaces,
48
+ **kwargs,
49
+ )
50
+
51
+ self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
52
+ self.vocab_file = vocab_file
53
+ self.add_bos_token = kwargs["add_bos_token"]
54
+ self.add_eos_token = kwargs["add_eos_token"]
55
+ self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
56
+ self.sp_model.Load(vocab_file)
57
+
58
+ self.add_tokens(self.all_special_tokens_extended, special_tokens=True)
59
+
60
+ # the functions below are copied from hf transformers LlamaTokenizer's implementation to fix the behaviour of the tokenizer
61
+ # https://github.com/huggingface/transformers/blob/v4.30.2/src/transformers/models/llama/tokenization_llama.py
62
+
63
+ def __getstate__(self) -> dict[str, Any]:
64
+ state = self.__dict__.copy()
65
+ state["sp_model"] = None
66
+ state["sp_model_proto"] = self.sp_model.serialized_model_proto()
67
+ return state
68
+
69
+ def __setstate__(self, d: dict[str, Any]) -> None:
70
+ self.__dict__ = d
71
+ self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
72
+ self.sp_model.LoadFromSerializedProto(self.sp_model_proto)
73
+
74
+ @property
75
+ def vocab_size(self) -> Any:
76
+ """Returns vocab size"""
77
+ return self.sp_model.get_piece_size()
78
+
79
+ def get_vocab(self) -> dict[str, int]:
80
+ """Returns vocab as a dict"""
81
+ vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
82
+ vocab.update(self.added_tokens_encoder)
83
+ return vocab
84
+
85
+ def convert_tokens_to_string(self, tokens: List[int]) -> str:
86
+ """Converts a sequence of tokens (string) in a single string."""
87
+ current_sub_tokens: List[int] = []
88
+ out_string = ""
89
+ prev_is_special = False
90
+ for i, token in enumerate(tokens):
91
+ # make sure that special tokens are not decoded using sentencepiece model
92
+ if token in self.all_special_tokens:
93
+ if not prev_is_special and i != 0:
94
+ out_string += " "
95
+ out_string += self.sp_model.decode(current_sub_tokens) + token
96
+ prev_is_special = True
97
+ current_sub_tokens = []
98
+ else:
99
+ current_sub_tokens.append(token)
100
+ prev_is_special = False
101
+ out_string += self.sp_model.decode(current_sub_tokens)
102
+ return out_string
103
+
104
+ def _tokenize(self, text: str) -> Any:
105
+ """Returns a tokenized string."""
106
+ return self.sp_model.encode(text, out_type=str)
107
+
108
+ def _convert_token_to_id(self, token: str) -> Any:
109
+ """Converts a token (str) in an id using the vocab."""
110
+ return self.sp_model.piece_to_id(token)
111
+
112
+ def _convert_id_to_token(self, index: int) -> Any:
113
+ """Converts an index (integer) in a token (str) using the vocab."""
114
+ token = self.sp_model.IdToPiece(index)
115
+ return token
116
+
117
+ def build_inputs_with_special_tokens(
118
+ self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
119
+ ) -> List[int]:
120
+ bos_token_id = [self.bos_token_id] if self.add_bos_token else []
121
+ eos_token_id = [self.eos_token_id] if self.add_eos_token else []
122
+
123
+ output = bos_token_id + token_ids_0 + eos_token_id
124
+
125
+ if token_ids_1 is not None:
126
+ output = output + bos_token_id + token_ids_1 + eos_token_id
127
+
128
+ return output
129
+
130
+ def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
131
+ """
132
+ Save the vocabulary and special tokens file to a directory.
133
+
134
+ Args:
135
+ save_directory (`str`):
136
+ The directory in which to save the vocabulary.
137
+
138
+ Returns:
139
+ `Tuple(str)`: Paths to the files saved.
140
+ """
141
+ if not os.path.isdir(save_directory):
142
+ logger.error(f"Vocabulary path ({save_directory}) should be a directory")
143
+ return ("",)
144
+ out_vocab_file = os.path.join(
145
+ save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"]
146
+ )
147
+
148
+ if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file) and os.path.isfile(self.vocab_file):
149
+ copyfile(self.vocab_file, out_vocab_file)
150
+ elif not os.path.isfile(self.vocab_file):
151
+ with open(out_vocab_file, "wb") as fi:
152
+ content_spiece_model = self.sp_model.serialized_model_proto()
153
+ fi.write(content_spiece_model)
154
+
155
+ return (out_vocab_file,)
156
+
157
+
158
+ class PlamoConfig(transformers.LlamaConfig): # type: ignore
159
+ model_type = "plamo"
tokenizer.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:59fe756bef3dc5d4813bd2eb9aeb7c39138cbd71e665bc85e6a4c10e766465da
3
+ size 1122464
tokenizer_config.json ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": false,
3
+ "add_eos_token": false,
4
+ "auto_map": {
5
+ "AutoTokenizer": [
6
+ "tokenization_plamo.PlamoTokenizer",
7
+ null
8
+ ]
9
+ },
10
+ "bos_token": "<s>",
11
+ "clean_up_tokenization_spaces": false,
12
+ "cls_token": "<cls>",
13
+ "eos_token": "</s>",
14
+ "mask_token": "<mask>",
15
+ "model_max_length": 1000000000000000019884624838656,
16
+ "pad_token": "<pad>",
17
+ "sep_token": "<sep>",
18
+ "sp_model_kwargs": null,
19
+ "tokenizer_class": "PlamoTokenizer",
20
+ "unk_token": "<unk>"
21
+ }