shailja commited on
Commit
a3489e4
·
1 Parent(s): 5900223

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +127 -0
README.md ADDED
@@ -0,0 +1,127 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: text-generation
3
+ inference: true
4
+ widget:
5
+ - text: module display_hello_word
6
+ example_title: Hello world
7
+ group: Verilog
8
+ license: bigcode-openrail-m
9
+ datasets:
10
+ - shailja/Verilog_GitHub
11
+ library_name: transformers
12
+ tags:
13
+ - code
14
+ model-index:
15
+ - name: VeriGen
16
+ results:
17
+ - task:
18
+ type: text-generation
19
+ dataset:
20
+ type: openai_humaneval
21
+ name: VeriEval (Prompted)
22
+ metrics:
23
+ - name: pass@1
24
+ type: pass@1
25
+ value:
26
+ verified: false
27
+ extra_gated_prompt: >-
28
+ ## Model License Agreement
29
+
30
+ Please read the BigCode [OpenRAIL-M
31
+ license](https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement)
32
+ agreement before accepting it.
33
+
34
+ extra_gated_fields:
35
+ I accept the above license agreement, and will use the Model complying with the set of use restrictions and sharing requirements: checkbox
36
+ ---
37
+
38
+
39
+ # VeriGen
40
+
41
+
42
+ ## Table of Contents
43
+
44
+ 1. [Model Summary](##model-summary)
45
+ 2. [Use](##use)
46
+ 3. [Limitations](##limitations)
47
+ 4. [Training](##training)
48
+ 5. [License](##license)
49
+ 6. [Citation](##citation)
50
+
51
+ ## Model Summary
52
+
53
+ The VeriGen model is 6B parameter models fine-tuned version of [CodeGen-multi-16B](https://github.com/salesforce/codegen) trained on [Verilog code dataset](https://huggingface.co/datasets/shailja/Verilog_GitHub) .
54
+
55
+ - **Repository:** [shailja-thakur/VGen](https://github.com/shailja-thakur/VGen)
56
+ - **Baseline LLM** [SalesForce/CodeGen](https://github.com/salesforce/CodeGen)
57
+ - **Paper:** [ Benchmarking Large Language Models for Automated Verilog RTL Code Generation](https://arxiv.org/abs/2212.11140)
58
+ - **Point of Contact:** [contact@shailja](mailto:[email protected])
59
+ - **Languages:** Verilog (Hardware Description Language)
60
+
61
+
62
+ ## Use
63
+
64
+ ### Intended use
65
+
66
+ The model was trained on Verilog from GitHub and textbooks. As such it is _not_ an instruction model and commands like "Write a module that implements a 2-to-1 Mux." do not work well. However, by additing a partial line of module header like "module mux" in addition with the text in the prompt turns it into a capable Verilog teaching assistant.
67
+
68
+ **Feel free to share your generations in the Community tab!**
69
+
70
+ ### Generation
71
+ ```python
72
+ # pip install -q transformers
73
+ import torch
74
+ from transformers import AutoTokenizer, AutoModelForCausalLM
75
+ # Prompt
76
+ prompt = "//module half adder "
77
+ device='cuda'
78
+ # Load model and tokenizer
79
+ model_name = "shailja/fine-tuned-codegen-6B-Verilog"
80
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
81
+ model = AutoModelForCausalLM.from_pretrained(model_name).to(device)
82
+
83
+ # Sample
84
+ input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
85
+ sample = model.generate(input_ids, max_length=128, temperature=0.5, top_p=0.9)
86
+
87
+ print(tokenizer.decode(sample[0], truncate_before_pattern=[r"endmodule"]) + "endmodule")
88
+ ```
89
+
90
+
91
+ ### Attribution & Other Requirements
92
+
93
+ The pretraining dataset of the model was not filtered for permissive licenses only. Nevertheless, the model can generate source code verbatim from the dataset. The code's license might require attribution and/or other specific requirements that must be respected.
94
+
95
+ # Limitations
96
+
97
+ The model has been trained on Verilog source code from open sources. The predominant natural language in source code is English, although other languages are also present. As such the model is capable of generating Verilog snippets provided some context but the generated code is not guaranteed to work as intended. It can be inefficient, contain bugs or exploits. See [the paper](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) for an in-depth discussion of the model limitations.
98
+
99
+ # Training
100
+
101
+ ## Model
102
+
103
+ - **Architecture:** GPT-2 model with multi-query attention
104
+ - **Pretraining steps:** 150k
105
+ - **Pretraining tokens:** ~72B
106
+ - **Precision:** fp16
107
+
108
+ ## Hardware
109
+
110
+ - **GPUs:** 4 Tesla A100
111
+ - **Training time:** 10 days
112
+
113
+
114
+ # License
115
+ The model is licensed under the BigCode OpenRAIL-M v1 license agreement. You can find the full agreement [here](https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement).
116
+ # Citation
117
+ ```
118
+ @misc{https://doi.org/10.48550/arxiv.2212.11140,
119
+ doi = {10.48550/ARXIV.2212.11140},
120
+ url = {https://arxiv.org/abs/2212.11140},
121
+ author = {Thakur, Shailja and Ahmad, Baleegh and Fan, Zhenxing and Pearce, Hammond and Tan, Benjamin and Karri, Ramesh and Dolan-Gavitt, Brendan and Garg, Siddharth},
122
+ title = {Benchmarking Large Language Models for Automated Verilog RTL Code Generation},
123
+ publisher = {arXiv},
124
+ year = {2022},
125
+ copyright = {arXiv.org perpetual, non-exclusive license}
126
+ }
127
+ ```