Transformers
PyTorch
code
custom_code
Inference Endpoints
codesage commited on
Commit
d672216
·
verified ·
1 Parent(s): 2acf928

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +16 -6
README.md CHANGED
@@ -9,6 +9,10 @@ language:
9
 
10
  ## CodeSage-Large
11
 
 
 
 
 
12
  ### Model description
13
  CodeSage is a new family of open code embedding models with an encoder architecture that support a wide range of source code understanding tasks. It is introduced in the paper:
14
 
@@ -21,25 +25,31 @@ This checkpoint is trained on the Stack data (https://huggingface.co/datasets/bi
21
  ### Training procedure
22
  This checkpoint is first trained on code data via masked language modeling (MLM) and then on bimodal text-code pair data. Please refer to the paper for more details.
23
 
24
- ### How to use
25
- This checkpoint consists of an encoder (1.3B model), which can be used to extract code embeddings of 2048 dimension. It can be easily loaded using the AutoModel functionality and employs the Starcoder tokenizer (https://arxiv.org/pdf/2305.06161.pdf).
26
 
 
 
27
  ```
28
  from transformers import AutoModel, AutoTokenizer
29
 
30
  checkpoint = "codesage/codesage-large"
31
  device = "cuda" # for GPU usage or "cpu" for CPU usage
32
 
33
- # Note: CodeSage requires adding eos token at the end of
34
- # each tokenized sequence to ensure good performance
35
  tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True, add_eos_token=True)
36
 
37
  model = AutoModel.from_pretrained(checkpoint, trust_remote_code=True).to(device)
38
 
39
  inputs = tokenizer.encode("def print_hello_world():\tprint('Hello World!')", return_tensors="pt").to(device)
40
  embedding = model(inputs)[0]
41
- print(f'Dimension of the embedding: {embedding[0].size()}')
42
- # Dimension of the embedding: torch.Size([14, 2048])
 
 
 
 
43
  ```
44
 
45
  ### BibTeX entry and citation info
 
9
 
10
  ## CodeSage-Large
11
 
12
+ ### Updates
13
+ * [12/2024] <span style="color:blue">We are excited to announce the release of the CodeSage V2 model family with largely improved performance and flexible embedding dimensions!</span> Please check out our [models](https://huggingface.co/codesage) and [blogpost](https://code-representation-learning.github.io/codesage-v2.html) for more details.
14
+ * [11/2024] You can now access CodeSage models through SentenceTransformer.
15
+
16
  ### Model description
17
  CodeSage is a new family of open code embedding models with an encoder architecture that support a wide range of source code understanding tasks. It is introduced in the paper:
18
 
 
25
  ### Training procedure
26
  This checkpoint is first trained on code data via masked language modeling (MLM) and then on bimodal text-code pair data. Please refer to the paper for more details.
27
 
28
+ ### How to Use
29
+ This checkpoint consists of an encoder (1.3B model), which can be used to extract code embeddings of 1024 dimension.
30
 
31
+ 1. Accessing CodeSage via HuggingFace: it can be easily loaded using the AutoModel functionality and employs the [Starcoder Tokenizer](https://arxiv.org/pdf/2305.06161.pdf).
32
+
33
  ```
34
  from transformers import AutoModel, AutoTokenizer
35
 
36
  checkpoint = "codesage/codesage-large"
37
  device = "cuda" # for GPU usage or "cpu" for CPU usage
38
 
39
+ # Note: CodeSage requires adding eos token at the end of each tokenized sequence
40
+
41
  tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True, add_eos_token=True)
42
 
43
  model = AutoModel.from_pretrained(checkpoint, trust_remote_code=True).to(device)
44
 
45
  inputs = tokenizer.encode("def print_hello_world():\tprint('Hello World!')", return_tensors="pt").to(device)
46
  embedding = model(inputs)[0]
47
+ ```
48
+
49
+ 2. Accessing CodeSage via SentenceTransformer
50
+ ```
51
+ from sentence_transformers import SentenceTransformer
52
+ model = SentenceTransformer("codesage/codesage-large", trust_remote_code=True)
53
  ```
54
 
55
  ### BibTeX entry and citation info