what is the final value of the training loss?
could you share the training loss curve pic ? Thanks a lot.
Here are the training loss curves:
it's hard to compare without context, on what data did you train your model, is it the same as this tiny_starcoder_py
? And what's your global batch size? For us it was 2M tokens (256 BS *8196 seq_length) so if you used the same batch size with 2048 seq length that means you trained on 4 times less data than we did with 50k steps
it's hard to compare without context, on what data did you train your model, is it the same as this
tiny_starcoder_py
? And what's your global batch size? For us it was 2M tokens (256 BS *8196 seq_length) so if you used the same batch size with 2048 seq length that means you trained on 4 times less data than we did with 50k steps
- data: same as this
tiny_starcoder_py
, the python code from starcoderdata. - global batch size: 8*16 = 128 => global batch tokens = 128 BS *2048
- seq_length: 2048
I'll try more.
Thanks a again