bigcode/tiny_starcoder_py · what is the final value of the training loss?

qinluo

May 31, 2023

could you share the training loss curve pic ? Thanks a lot.

loubnabnl

BigCode org May 31, 2023

•

edited May 31, 2023

Here are the training loss curves:

qinluo

May 31, 2023

Thanks very much.

I've got a similar training loss curve, but with weird Human-Eval pass@1 = 0.0%.

The Model Arch: Multi-Head-Attention + FlashAttention + Rotary Position embedding + 2048 context length

loubnabnl

BigCode org May 31, 2023

it's hard to compare without context, on what data did you train your model, is it the same as this tiny_starcoder_py? And what's your global batch size? For us it was 2M tokens (256 BS *8196 seq_length) so if you used the same batch size with 2048 seq length that means you trained on 4 times less data than we did with 50k steps

qinluo

May 31, 2023

it's hard to compare without context, on what data did you train your model, is it the same as this tiny_starcoder_py? And what's your global batch size? For us it was 2M tokens (256 BS *8196 seq_length) so if you used the same batch size with 2048 seq length that means you trained on 4 times less data than we did with 50k steps

data: same as this tiny_starcoder_py, the python code from starcoderdata.
global batch size: 8*16 = 128 => global batch tokens = 128 BS *2048
seq_length: 2048

I'll try more.

Thanks a again

loubnabnl changed discussion status to closed Aug 15, 2023