Text Generation
Transformers
Safetensors
gemma
conversational
text-generation-inference
Inference Endpoints
File size: 8,940 Bytes
8ca81f1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c97c065
8ca81f1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
---
license: mit
library_name: transformers
datasets:
- md-nishat-008/Mojo-Corpus
- md-nishat-008/Mojo-SFT
- md-nishat-008/Mojo-mSFT
pipeline_tag: text-generation
---

<div align="center">
<h1>πŸ”₯ Mojo-Coder πŸ”₯</h1>
<em>State-of-the-art Language Model for Mojo Programming</em>
</div>


<div align="center">
<table><tr>
<td><a href="https://arxiv.org/abs/2410.17736"><img src="https://img.shields.io/badge/arXiv-Read_Paper-blue?style=for-the-badge&logo=arxiv" /></a></td>
<td><a href="mailto:[email protected]"><img src="https://img.shields.io/badge/Email-Contact_Us-blue?style=for-the-badge&logo=gmail" /></a></td>
</tr></table>
</div>



<div align="center">
<h2>🎯 Background and Motivation</h2>
</div>

Mojo programming language, developed by Modular, has emerged as a game-changing technology in high-performance computing and AI development. Despite its growing popularity and impressive capabilities (up to 68,000x faster than Python!), existing LLMs struggle with Mojo code generation. Mojo-Coder addresses this gap by providing specialized support for Mojo programming, built upon the robust architecture of [CodeGemma-7B-IT](https://huggingface.co/google/codegemma-7b-it/).

<div align="center">
<h2>πŸ€– Model Information</h2>
</div>

Mojo-Coder transforms natural language instructions into optimized Mojo code, supporting multiple languages (English, German, French, Spanish, and Bangla) while maintaining high-quality code generation capabilities.

<div align="center">
<h2>πŸ“ Description</h2>
</div>

The Mojo-Coder family consists of three specialized 7B-parameter models, each built on CodeGemma's architecture:
|                            | <h3><a href="https://huggingface.co/md-nishat-008/mojo-coder" style="color: #0969DA;">mojo-coder</a> πŸ”₯</h3> | <h3><a href="https://huggingface.co/md-nishat-008/mojo-coder-it" style="color: #0969DA;">mojo-coder-it</a> πŸŽ†</h3> | <h3><a href="https://huggingface.co/md-nishat-008/mojo-coder-it-m" style="color: #0969DA;">mojo-coder-it-m</a> ⭐</h3> |
|---------------------------|:---:|:---:|:---:|
| πŸ”„ Code Completion        | βœ… | βœ… | βœ… |
| πŸ’‘ NL β†’ Code Generation   |    | βœ… | βœ… |
| 🌏 Multilingual Support   |    |    | βœ… |
| πŸ“ Instruction Following  |    | βœ… | βœ… |

<div align="center">
<h2>πŸš€ Sample Usage</h2>
</div>

Choose the model that best fits your needs:
- For basic Mojo code completion: [mojo-coder](https://huggingface.co/md-nishat-008/mojo-coder)
- For English instruction-based code generation: [mojo-coder-it](https://huggingface.co/md-nishat-008/mojo-coder-it)
- For multilingual support: [mojo-coder-it-m](https://huggingface.co/md-nishat-008/mojo-coder-it-m)

Notably, our models significantly outperform current state-of-the-art models including GPT-4o and Claude-3.5-Sonnet on the HumanEval-Mojo benchmark.






<div style="color: red; text-align: center; padding: 10px; margin: 20px 0; border: 2px solid red; border-radius: 5px;">
<strong>⚠️ IMPORTANT: When using the model, you MUST explicitly mention "Mojo" in your prompts (e.g., "Write a Mojo function to...", "Create Mojo code that...") otherwise the model may not generate Mojo code!</strong>
</div>

#### For Code Generation

```python
from transformers import GemmaTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("md-nishat-008/Mojo-Coder-it")
model = AutoModelForCausalLM.from_pretrained("md-nishat-008/Mojo-Coder-it")

input_text = "Write me a Mojo function to calculate the nth fibonacci number."
input_ids = tokenizer(input_text, return_tensors="pt")

outputs = model.generate(**input_ids)
print(tokenizer.decode(outputs[0]))
```

#### Chat Template

The instruction-tuned models use a chat template that must be adhered to for conversational use.
The easiest way to apply it is using the tokenizer's built-in chat template, as shown in the following snippet.

Let's load the model and apply the chat template to a conversation. In this example, we'll start with a single user interaction:

```py
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained("md-nishat-008/Mojo-Coder-it")
model = AutoModelForCausalLM.from_pretrained("md-nishat-008/Mojo-Coder-it")

chat = [{"role": "user", "content": "Write a function that calculates factorial of a number in Mojo"}]
inputs = tokenizer.apply_chat_template(chat, tokenize=True, return_tensors="pt").to("cuda")

with torch.no_grad():
    outputs = model.generate(
        inputs=inputs,
        max_new_tokens=1000,
        temperature=0.7,
        top_p=0.95,
        pad_token_id=tokenizer.eos_token_id,
        eos_token_id=tokenizer.eos_token_id
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

At this point, the prompt contains the following text:

```
<bos><start_of_turn>user
Write a hello world program in Mojo<end_of_turn>
<start_of_turn>model
```

As you can see, each turn is preceded by a `<start_of_turn>` delimiter and then the role of the entity
(either `user`, for content supplied by the user, or `model` for LLM responses). Turns finish with
the `<end_of_turn>` token.

You can follow this format to build the prompt manually, if you need to do it without the tokenizer's
chat template.

After the prompt is ready, generation can be performed like this:

```py
inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=150)
```

<div align="center">
<h2>βš™οΈ Inputs and Outputs</h2>
</div>

**Inputs**:
- For base model (mojo-coder): code prefix and/or suffix for Mojo code completion
- For instruction-tuned models (mojo-coder-it & mojo-coder-it-m): natural language prompts/instructions

<p style="color: red;"><strong>Note: In prompts, you must explicitly mention "Mojo" (e.g., "Write a Mojo function to...", "Write Mojo code to...") otherwise the models may not generate Mojo code.</strong></p>

**Outputs**:
- For all variants: Mojo code snippets and natural language responses
- Additional explanations and documentation when requested

<div align="center">
<h2>πŸ“š Model Data</h2>
</div>

### Training Dataset

Using [CodeGemma-7B-IT](https://huggingface.co/google/codegemma-7b-it/) as our base model, we further trained on:
- [Mojo-Corpus](https://huggingface.co/datasets/md-nishat-008/Mojo_Corpus): 6.5M tokens of curated Mojo code from public repositories
- [Mojo-SFT](https://huggingface.co/datasets/md-nishat-008/Mojo_SFT): 3,200 instruction-code pairs for English
- [Mojo-mSFT](https://huggingface.co/datasets/md-nishat-008/Mojo_mSFT): Multilingual instruction-code pairs in 5 languages

### Training Data Processing

The following data pre-processing techniques were applied:
- Rigorous filtering pipeline (F1-F6) to ensure code quality
- Apache 2.0 license compliance
- Language detection using fastText
- Duplicate removal and content validation
- Expert review for instruction-code pairs

<div align="center">
<h2>πŸ“Š Evaluation Information</h2>
</div>

### Evaluation Approach

We evaluate Mojo-Coder on:
- [HumanEval-Mojo](https://huggingface.co/datasets/md-nishat-008/HumanEval-Mojo): First benchmark for Mojo code generation
- Multi-language instruction following
- Code quality and execution success

### Evaluation Results

#### Code Generation Benchmarks (Pass@1)

| Model | HumanEval-Mojo |
|-------|----------------|
| GPT-4o | 25.5% |
| Claude-3.5-Sonnet | 39.8% |
| mojo-coder | 36.7% |
| mojo-coder-it-m | 61.5% |
| mojo-coder-it | 66.4% |

<div align="center">
<h2>⚠️ Limitations and Usage</h2>
</div>

### Intended Usage
- Mojo code completion and generation
- Multi-language instruction following
- Code documentation and explanation
- Educational support for Mojo programming

### Known Limitations
- Limited to Mojo programming language
- Requires explicit mention of "Mojo" in prompts
- Performance may vary with complex algorithms
- May occasionally generate Python-like syntax
- Based on data available up to 2024

### Ethical Considerations
The model is designed for:
- Educational and development purposes
- Open-source contribution to Mojo ecosystem
- Supporting multilingual access to Mojo programming

Code should be reviewed and tested before production use, especially for performance-critical applications.



<div align="center">
<h2>πŸ“š Citation</h2>
</div>

If you find our work helpful, please consider citing our paper:

<div style="background-color: #f6f8fa; padding: 20px; border-radius: 5px; margin: 10px 0;">
<p style="margin-bottom: 10px;"><strong>MojoBench: Language Modeling and Benchmarks for Mojo</strong></p>

```bibtex
@inproceedings{Raihan2024MojoBenchLM,
   title     = {MojoBench: Language Modeling and Benchmarks for Mojo},
   author    = {Raihan, Nishat and Santos, Joanna C. S. and Zampieri, Marcos},
   year      = {2024},
   url       = {https://api.semanticscholar.org/CorpusID:273532552}
}
```