File size: 3,385 Bytes
b265248
 
 
 
 
 
 
 
58c1b6c
d1b2ad8
2835c8a
 
 
 
58c1b6c
 
2835c8a
d1b2ad8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7dfe10e
d1b2ad8
7dfe10e
d1b2ad8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7dfe10e
d1b2ad8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
---
frameworks:
- Pytorch
license: other
tasks:
- text-embedding
---


## CodeFuse-CGE-Large
<p align="center">
    <img src="https://modelscope.cn/api/v1/models/codefuse-ai/CodeFuse-QWen-14B/repo?Revision=master&FilePath=LOGO.jpg&View=true" width="800"/>
<p>

**Homepage**: 🏡 https://github.com/codefuse-ai/CodeFuse-CGE (**Please give us your support with a Star🌟 + Fork🚀 + Watch👀**)

## Model Description
CodeFuse-CGE-Large is the Large version of the CodeFuse-CGE family which is fine-tuned based on CodeQwen1.5-7B. CodeFuse-CGE-Large is distinguish on text2code task for it's powerful ability of capturing the semantic relationship between code and text.  
  
This model has the following notable features:  
● Instruction-tuning is enabled for both query and code snippet sides.  
● The model obtains sentence-level and code-level representations through a layer of cross-attention computation module.  
● The model has a smaller dimensional size without significant degradation in performance.

Model Configuration  
Model Size: 7B  
Embedding Dimension: 1024  
Hidden Layers: 32  
Max Input Tokens: 1024  


Requirements  
```
flash_attn==2.4.2
torch==2.1.0
accelerate==0.28.0
transformers==4.39.2 
vllm=0.5.3
```

## How to Use
### transformers
```
from transformers import AutoTokenizer, AutoModel
import torch

model_name_or_path = "codefuse-ai/CodeFuse-CGE-Large"
model = AutoModel.from_pretrained(model_name_or_path, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True, truncation_side='right', padding_side='right')

if torch.cuda.is_available():
    device = 'cuda'
else:
    device = 'cpu'
model.to(device)

prefix_dict =  {'python':{'query':'Retrieve the Python code that solves the following query:', 'passage':'Python code:'},
                'java':{'query':'Retrieve the Java code that solves the following query:', 'passage':'Java code:'},
                'go':{'query':'Retrieve the Go code that solves the following query:', 'passage':'Go code:'},
                'c++':{'query':'Retrieve the C++ code that solves the following query:', 'passage':'C++ code:'},
                'javascript':{'query':'Retrieve the Javascript code that solves the following query:', 'passage':'Javascript code:'},
                'php':{'query':'Retrieve the PHP code that solves the following query:', 'passage':'PHP code:'},
                'ruby':{'query':'Retrieve the Ruby code that solves the following query:', 'passage':'Ruby code:'},
                'default':{'query':'Retrieve the code that solves the following query:', 'passage':'Code:'}
                }

text = ["Writes a Boolean to the stream.",
        "def writeBoolean(self, n): t = TYPE_BOOL_TRUE if n is False: t = TYPE_BOOL_FALSE self.stream.write(t)"]
text[0] += prefix_dict['python']['query']
text[1] += prefix_dict['python']['passage']
embed = model.encode(tokenizer, text)
score = embed[0] @ embed[1].T
print("score", score)

```

## Benchmark the Performance
We use MRR metric to evaluate the ability on text2code retrieval tasks: AdvTest, CosQA, CSN


![result](./result.png)

## Acknowledgement
Thanks to the authors of open-sourced datasets, including CSN, Adv, CoSQA.

## License
Since CodeFuse-CGE-Large is fine-tuned based on CodeQwen1.5-7B model, our usage license follows the same terms as that of CodeQwen1.5-7B model.