what is the final value of the training loss?

#4
by qinluo - opened

could you share the training loss curve pic ? Thanks a lot.

Here are the training loss curves:

Thanks very much.

I've got a similar training loss curve, but with weird Human-Eval pass@1 = 0.0%.

The Model Arch: Multi-Head-Attention + FlashAttention + Rotary Position embedding + 2048 context length

image.png

BigCode org

it's hard to compare without context, on what data did you train your model, is it the same as this tiny_starcoder_py? And what's your global batch size? For us it was 2M tokens (256 BS *8196 seq_length) so if you used the same batch size with 2048 seq length that means you trained on 4 times less data than we did with 50k steps

it's hard to compare without context, on what data did you train your model, is it the same as this tiny_starcoder_py? And what's your global batch size? For us it was 2M tokens (256 BS *8196 seq_length) so if you used the same batch size with 2048 seq length that means you trained on 4 times less data than we did with 50k steps

  1. data: same as this tiny_starcoder_py, the python code from starcoderdata.
  2. global batch size: 8*16 = 128 => global batch tokens = 128 BS *2048
  3. seq_length: 2048

I'll try more.

Thanks a again

loubnabnl changed discussion status to closed

Sign up or log in to comment