How to read the logs created by hugging face trainer?

illstart · February 22, 2023, 2:57pm

args = TrainingArguments(
    output_dir='./results'
    evaluation_strategy="epoch",
    save_strategy="epoch",
    logging_dir='./logs',
    label_names=["labels"],
)

How to read the training and validation losses?

I am getting logs like this /content/logs/events.out.tfevents.1677075882.2041f82e97b0.219.0

mapama247 · February 23, 2023, 10:30am

The logging_dir is where Tensorboard files are stored. Since you’re not specifying --logging_strategy/--logging_steps, the Trainer is logging every 500 steps by default. You can visualize the data in a web browser with the following command:
tensorboard --logdir content/logs

This will output some text in the terminal, which should contain a localhost address (something like “http://localhost:6006/”). Copy&paste that to a web browser (or do ctrl+cllick) and you should see a bunch of tensorboard plots.

SUNM · May 19, 2023, 2:08am

Hi @mapama247 , sorry, should I write these command in the terminal or in the python environment?

SUNM · May 19, 2023, 3:17am

Hi @mapama247 , I would appreciate if you could let me know how I can see my tensors my system is linux. my training code is as follow: and Results_Path=‘/home/nlpproject/’

training_args = TrainingArguments(output_dir=Results_Path, learning_rate=5e-5,num_train_epochs=15, evaluation_strategy="epoch", logging_strategy="epoch",save_strategy="epoch",seed=42,load_best_model_at_end=True,logging_steps=5000,
                                per_device_train_batch_size=2, save_total_limit=1,per_device_eval_batch_size=2,warmup_steps=100, weight_decay=0.01, logging_dir=Results_Path)


Trainer(model=model, args=training_args,train_dataset=train_dataset,
        eval_dataset=val_dataset,data_collator=lambda data: {'input_ids': torch.stack([f[0] for f in data]),
                                                              'attention_mask': torch.stack([f[1] for f in data]),
                                                              'labels': torch.stack([f[0] for f in data])}).train()

SUNM · May 19, 2023, 3:55am

hi @illstart , I hop eu are fine. sorry, I wana visualize the logs training and validation loss in trainer. would you please tell me how did u do that?

 training_args = TrainingArguments(output_dir=Results_Path, learning_rate=5e-5,num_train_epochs=15, evaluation_strategy="epoch", logging_strategy="epoch",save_strategy="epoch",seed=42,load_best_model_at_end=True,logging_steps=5000,
                                per_device_train_batch_size=2, save_total_limit=1,per_device_eval_batch_size=2,warmup_steps=100, weight_decay=0.01, logging_dir=Results_Path)

Trainer(model=model, args=training_args,train_dataset=train_dataset,
        eval_dataset=val_dataset,data_collator=lambda data: {'input_ids': torch.stack([f[0] for f in data]),
                                                              'attention_mask': torch.stack([f[1] for f in data]),
                                                              'labels': torch.stack([f[0] for f in data])}).train()

mapama247 · May 19, 2023, 7:30pm

In the terminal, as long as you have tensorboard installed.

mapama247 · May 19, 2023, 7:32pm

Do you see some files (with weird names) in /home/nlpproject/? In that case, open a terminal and do:

tensorboard --logdir /home/nlpproject/

SUNM · May 20, 2023, 1:04am

@mapama247 , many many thanks for your reply. Sorry, dusing training I can see the saved checkpoints, but when the training is finished no checkpints is saved for testing. all checkpoints disappear in the folder. would you please tell me how I can save the best model , my code is as follow


training_args = TrainingArguments(output_dir=Results_Path, learning_rate=5e-5,num_train_epochs=10, evaluation_strategy="epoch", logging_strategy="epoch",save_strategy="epoch",seed=42,load_best_model_at_end=True,
        report_to="tensorboard",per_device_train_batch_size=2, save_total_limit=1,per_device_eval_batch_size=2,warmup_steps=100, weight_decay=0.01, logging_dir=Results_Path)


Trainer(model=model, args=training_args, tokenizer=tokenizer,train_dataset=train_dataset,
        eval_dataset=val_dataset,data_collator=lambda data: {'input_ids': torch.stack([f[0] for f in data]),
                                                              'attention_mask': torch.stack([f[1] for f in data]),
                                                              'labels': torch.stack([f[0] for f in data])}).train()

mapama247 · May 21, 2023, 9:26pm

Ah okay… but this is a different problem that has nothing to do with tensorboard The reason why you see that the checkpoints disappear is that you have the argument save_total_limit=1, which limits the number of saved checkpoints to 1. Just remove that or increase the number.

SUNM · May 22, 2023, 1:28am

Hi @mapama247 , sorry, do you know how I can save the model for each epoch? regardless of it is the best model or not? I want to save model after each epoch. if I change teh steps to epoch it won’t save any checkpoints at the end. I want to save model after each epoch.
this is my code

training_args = TrainingArguments(output_dir=Results_Path, learning_rate=5e-5,num_train_epochs=15,evaluation_strategy="steps", logging_strategy="steps",save_strategy="steps",save_steps=500,seed=42,load_best_model_at_end=True,logging_steps=500,
report_to="tensorboard",per_device_train_batch_size=2,eval_steps=500,save_total_limit=2,per_device_eval_batch_size=2,
warmup_steps=100, weight_decay=0.01, logging_dir=Results_Path)

mapama247 · May 26, 2023, 2:02pm

Remove save_total_limit=2, set save_strategy, evaluation_strategy and logging_strategy to “epoch”, and remove save_steps=500, eval_steps=500 and logging_steps=500. This way you should get 15 folders with a different checkpoint each (one per epoch) and the best checkpoint directly on your output_dir.

SUNM · May 28, 2023, 1:07am

@mapama247 many thanks for your reply. sorry, have u been use multiple GPU by using trainer API? I use it but the results are very strange in comparison with using 1 gpu. indeed I didn’t change anything in my code for using 1 GPU and let trainer use all gpu but results are very strange. can you please help me with that?

shoang · July 11, 2023, 8:39pm

I ran into the same issue with multi-gpu training. Were you able to resolve this?

RJSD3V · April 26, 2024, 12:29pm

Simply getting the logs of the trainer object, you could use trainer.state.log_history

To get detailed logs of everything hf does under the hood though:
is to disable the huggingface default logger and add your own custom python logger that writes to a file or writes to stdout. (What worked for me)

after you disable the default handler, do remember to logging.enable_propogation() on the huggingface logger to be able to add your own logger for the “transformers” library.

Checkout the definitions for enabling and disabling the default loggers and enabling propogation here:

github.com

huggingface/transformers/blob/20081c743ee2ce31d178f2182c7466c3313adcd2/src/transformers/utils/logging.py#L233


      
          
          def disable_default_handler() -> None:
              """Disable the default handler of the HuggingFace Transformers's root logger."""
          
              _configure_library_root_logger()
          
              assert _default_handler is not None
              _get_library_root_logger().removeHandler(_default_handler)
          
          
          def enable_default_handler() -> None:
              """Enable the default handler of the HuggingFace Transformers's root logger."""
          
              _configure_library_root_logger()
          
              assert _default_handler is not None
              _get_library_root_logger().addHandler(_default_handler)
          
          
          def add_handler(handler: logging.Handler) -> None:
              """adds a handler to the HuggingFace Transformers's root logger."""

Topic		Replies	Views
Logs of training and validation loss Beginners	9	30248	May 18, 2023
Logging text using model outputs with tensorboard Beginners	4	948	May 19, 2024
Trainer log my custom metrics at training step Beginners	3	2957	July 11, 2024
No log for validation loss in trainer.train() Beginners	4	5700	April 13, 2024
Trainer doesn't show the loss at each step 🤗Transformers	20	33475	May 9, 2024

How to read the logs created by hugging face trainer?

Related topics