[TL;DR]
Please go through the questions and any help would be appreciated.
I am trying to train GPT2 model from scratch.
I am not sure if I am doing right and I have got a few questions. Here is my current implementation.
Tokenizer:
Question1:
Am I training the tokenizer right way? Should I use all of training text files to train tokenizers?
from pathlib import Path
from tokenizers import ByteLevelBPETokenizer
paths = [str(x) for x in Path(".").glob("**/*.txt")]
# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()
# Customize training
tokenizer.train(files=paths, vocab_size=52_000, min_frequency=2, special_tokens=[
"<s>",
"<pad>",
"</s>",
"<unk>",
"<mask>",
])
!mkdir German
tokenizer.save_model("German")
The above code gives me : [âGerman/vocab.jsonâ, âGerman/merges.txtâ]
Initializing the GPT2 tokenizer
Question2: Is this the right way to do it?
from transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("./German",additional_special_tokens=["<s>","<pad>","</s>","<unk>","<mask>"],pad_token='<pad>' ,max_len=512)
Initlizing the GPT2 model:
Question3: Am I initilizing GPT2 Language model properly?
from transformers import GPT2Model, GPT2Config,GPT2LMHeadModel
# Initializing a GPT2 configuration
configuration = GPT2Config(vocab_size=52_000)
model = GPT2LMHeadModel(config=configuration)
Dataset:
My dataset is just an example text file containing 1 million German Dataset:
Question4: Now What kind of dataset do you suggest I use to train this model?
Here is my logic for Dataset loading and training.
from transformers import TextDataset
dataset = TextDataset(
tokenizer=tokenizer,
file_path="./deu-de_web-public_2019_1M-sentences.txt",
block_size=128,
)
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer, mlm=False,
)
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
output_dir="./output",
overwrite_output_dir=True,
num_train_epochs=1,
per_gpu_train_batch_size=64,
save_steps=10_000,
save_total_limit=2,
)
trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=dataset,
prediction_loss_only=True,
)
trainer.train()
Question5: Is this the way to train the model? Is there any specific data format the model expects the training data to be?
Question6: HOw can we use multiple GPUs?