Training GPT-2 from scratch

miguelvictor · August 2, 2020, 7:22am

Hello!

I’m currently working on a toy project that uses GPT-2 (smallest variant but only 6 layers, from scratch) to predict next tokens in the context of programming languages. So my dataset are all source codes and I am using a custom tokenizer and i have the following questions:

If my sample is longer than 1024 tokens (supposing the model’s max length is 1024), is the past tokens automatically fed back to the model during training? or should I do it myself?
My custom tokenizer works well (in my opinion) but i want to use the huggingface API to take advantage of the “fast” tokenizers. How do I go about subclassing the Tokenizer class so that my tokenizer is compatible with huggingface’s tokenizer api ?

Thank you very much!!!

valhalla · August 2, 2020, 2:31pm

Hi @miguelvictor,

You can train you tokenizer using the`tokenizers library. These are fast rust tokenizers with python API.

miguelvictor · August 3, 2020, 3:41pm

Hello. Thank you for replying.

Do you have any ideas about my question#1? I tried looking at the source code and it seems no “past” is given to the model during training for longer sequences (or maybe I’m wrong)

Topic		Replies	Views
GPT2 Training from scratch in German 🤗Transformers	3	2271	October 3, 2020
Train gpt-2 from scratch in Italian Beginners	0	871	September 8, 2022
Building a GPT2 dataset from long sequences 🤗Datasets	1	500	September 19, 2022
Can't figure out how to implement gpt2 tokenizer in fine-tuning Beginners	0	326	July 22, 2022
Creating a Custom Token Vocabulary for GPT-2 🤗Tokenizers	1	28	January 7, 2025

Training GPT-2 from scratch

Related topics