The logging_dir is where Tensorboard files are stored. Since youâre not specifying --logging_strategy/--logging_steps, the Trainer is logging every 500 steps by default. You can visualize the data in a web browser with the following command: tensorboard --logdir content/logs
This will output some text in the terminal, which should contain a localhost address (something like âhttp://localhost:6006/â). Copy&paste that to a web browser (or do ctrl+cllick) and you should see a bunch of tensorboard plots.
Hi @mapama247 , I would appreciate if you could let me know how I can see my tensors my system is linux. my training code is as follow: and Results_Path=â/home/nlpproject/â
training_args = TrainingArguments(output_dir=Results_Path, learning_rate=5e-5,num_train_epochs=15, evaluation_strategy="epoch", logging_strategy="epoch",save_strategy="epoch",seed=42,load_best_model_at_end=True,logging_steps=5000,
per_device_train_batch_size=2, save_total_limit=1,per_device_eval_batch_size=2,warmup_steps=100, weight_decay=0.01, logging_dir=Results_Path)
Trainer(model=model, args=training_args,train_dataset=train_dataset,
eval_dataset=val_dataset,data_collator=lambda data: {'input_ids': torch.stack([f[0] for f in data]),
'attention_mask': torch.stack([f[1] for f in data]),
'labels': torch.stack([f[0] for f in data])}).train()
hi @illstart , I hop eu are fine. sorry, I wana visualize the logs training and validation loss in trainer. would you please tell me how did u do that?
training_args = TrainingArguments(output_dir=Results_Path, learning_rate=5e-5,num_train_epochs=15, evaluation_strategy="epoch", logging_strategy="epoch",save_strategy="epoch",seed=42,load_best_model_at_end=True,logging_steps=5000,
per_device_train_batch_size=2, save_total_limit=1,per_device_eval_batch_size=2,warmup_steps=100, weight_decay=0.01, logging_dir=Results_Path)
Trainer(model=model, args=training_args,train_dataset=train_dataset,
eval_dataset=val_dataset,data_collator=lambda data: {'input_ids': torch.stack([f[0] for f in data]),
'attention_mask': torch.stack([f[1] for f in data]),
'labels': torch.stack([f[0] for f in data])}).train()
@mapama247 , many many thanks for your reply. Sorry, dusing training I can see the saved checkpoints, but when the training is finished no checkpints is saved for testing. all checkpoints disappear in the folder. would you please tell me how I can save the best model , my code is as follow
training_args = TrainingArguments(output_dir=Results_Path, learning_rate=5e-5,num_train_epochs=10, evaluation_strategy="epoch", logging_strategy="epoch",save_strategy="epoch",seed=42,load_best_model_at_end=True,
report_to="tensorboard",per_device_train_batch_size=2, save_total_limit=1,per_device_eval_batch_size=2,warmup_steps=100, weight_decay=0.01, logging_dir=Results_Path)
Trainer(model=model, args=training_args, tokenizer=tokenizer,train_dataset=train_dataset,
eval_dataset=val_dataset,data_collator=lambda data: {'input_ids': torch.stack([f[0] for f in data]),
'attention_mask': torch.stack([f[1] for f in data]),
'labels': torch.stack([f[0] for f in data])}).train()
Ah okay⌠but this is a different problem that has nothing to do with tensorboard The reason why you see that the checkpoints disappear is that you have the argument save_total_limit=1, which limits the number of saved checkpoints to 1. Just remove that or increase the number.
Hi @mapama247 , sorry, do you know how I can save the model for each epoch? regardless of it is the best model or not? I want to save model after each epoch. if I change teh steps to epoch it wonât save any checkpoints at the end. I want to save model after each epoch.
this is my code
Remove save_total_limit=2, set save_strategy, evaluation_strategy and logging_strategy to âepochâ, and remove save_steps=500, eval_steps=500 and logging_steps=500. This way you should get 15 folders with a different checkpoint each (one per epoch) and the best checkpoint directly on your output_dir.
@mapama247 many thanks for your reply. sorry, have u been use multiple GPU by using trainer API? I use it but the results are very strange in comparison with using 1 gpu. indeed I didnât change anything in my code for using 1 GPU and let trainer use all gpu but results are very strange. can you please help me with that?
Simply getting the logs of the trainer object, you could use trainer.state.log_history
To get detailed logs of everything hf does under the hood though:
is to disable the huggingface default logger and add your own custom python logger that writes to a file or writes to stdout. (What worked for me)
after you disable the default handler, do remember to logging.enable_propogation() on the huggingface logger to be able to add your own logger for the âtransformersâ library.
Checkout the definitions for enabling and disabling the default loggers and enabling propogation here: