How to debug NaN output of a logits in training

seand0101 · December 26, 2024, 1:40pm

Dear community members and @John6666 ,

I have successfully find a way to train, continuation from past questions, and finds a problem that I could not debug using “print” method.

I found that some of the variables are nan or empty, I did find from this topic that the learning rate or even the data shape difference between GPUs can affect the calculations. Where do I start to debug something like this?

I did know and want to look at these bit of code

import torch
from torch import nn
import evaluate

metric = evaluate.load("mean_iou")

def compute_metrics(eval_pred):
  with torch.no_grad():
    logits, labels = eval_pred
    logits_tensor = torch.from_numpy(logits)
    # scale the logits to the size of the label
    logits_tensor = nn.functional.interpolate(
        logits_tensor,
        size=labels.shape[-2:],
        mode="bilinear",
        align_corners=False,
    ).argmax(dim=1)

    pred_labels = logits_tensor.detach().cpu().numpy()
    # currently using _compute instead of compute
    # see this issue for more info: https://github.com/huggingface/evaluate/pull/328#issuecomment-1286866576
    metrics = metric._compute(
            predictions=pred_labels,
            references=labels,
            num_labels=num_labels,
            ignore_index=0  
            )
  
    print("metrics:" metrics)
    # add per category metrics as individual key-value pairs
    per_category_accuracy = metrics.pop("per_category_accuracy").tolist()
    per_category_iou = metrics.pop("per_category_iou").tolist()

    metrics.update({f"accuracy_{id2label[i]}": v for i, v in enumerate(per_category_accuracy)})
    metrics.update({f"iou_{id2label[i]}": v for i, v in enumerate(per_category_iou)})
    
    return metrics

print(len(id2label))

Like probably between all these variables there’s something I need to readdress (not in my dataset, out of bounds again) or convert to other data types (int for example, or it’s still in a List and unretrievable). I did try to print it along the functions but looks like if it’s not returned then it won’t show in the prompt.

wait let me do that,

huh but this part is almost hidden and all a black box


from transformers import Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    tokenizer=feature_extractor,
    train_dataset=train_ds,
    eval_dataset=test_ds,
    compute_metrics=compute_metrics,
)

Like how do I know the output of theses

New error after several tweaking

/root/.cache/huggingface/modules/evaluate_modules/metrics/evaluate-metric--mean_iou/9e450724f21f05592bfb0255fe2fa576df8171fa060d11121d8aecfff0db80d0/mean_iou.py:258: RuntimeWarning: invalid value encountered in scalar divide
  all_acc = total_area_intersect.sum() / total_area_label.sum()
/root/.cache/huggingface/modules/evaluate_modules/metrics/evaluate-metric--mean_iou/9e450724f21f05592bfb0255fe2fa576df8171fa060d11121d8aecfff0db80d0/mean_iou.py:260: RuntimeWarning: invalid value encountered in divide
  acc = total_area_intersect / total_area_label
/root/.cache/huggingface/modules/evaluate_modules/metrics/evaluate-metric--mean_iou/9e450724f21f05592bfb0255fe2fa576df8171fa060d11121d8aecfff0db80d0/mean_iou.py:263: RuntimeWarning: Mean of empty slice
  metrics["mean_accuracy"] = np.nanmean(acc)

Metrics before post-processing: {'mean_iou': 0.0, 'mean_accuracy': nan, 'overall_accuracy': nan, 'per_category_iou': array([0., 0.]), 'per_category_accuracy': array([nan, nan])}

Does this mean there’s problem with my mask and pictures again? Feel free to ask several codes if its incomplete for reasoning

Alanturner2 · December 26, 2024, 3:35pm

First, I saw your picture. From your picture, in step 20th, there is exploding problem. In my opinion, you should normalize your dataset. Normalization is the easiest way and also critical step.
Second, if you want to see the training period, I recommend you use MLflow.
I hope my advice should be helpful for your debugging.

seand0101 · December 26, 2024, 3:51pm

Can I just torch.nn.functional.normalize on each tensors as in this part?

 logits_tensor = nn.functional.interpolate(
        logits_tensor,
        size=labels.shape[-2:],
        mode="bilinear",
        align_corners=False,
    ).argmax(dim=1)

because when I did that it asks for dimension, is it according to my image dimension (RGB or BW depending on image or mask if that’s what it meant as dimension) or else? The documentation telling that it should be the amount of dimension should it be decreased. Which dimension? Am I confusing wrong things?

About the MLFlow, I’m doing this on Google Collab, is it a GUI which I should run my codes in there for debugging?

Nvm I found way to put it on my code, let me see.

Alanturner2 · December 26, 2024, 3:59pm

I read your error.
In your error there is an division. Then there is an error. In my opinion, there are 2 reasons.
First, your datasets label or prediction has mismatch. So your labels are zero or undefined.
Second, your model training has problem. In other words your parameter initialization or something else has an error, so there is no change in your model weights.

So how about check your dataset and visualize your training.

This is the answer for second question.
I think that your strategy is good. But there shouldn’t be decreased dimension, it just for your training speed.
And you can use MLflow like this.

import mlflow
import mlflow.sklearn

# Start an MLflow run
with mlflow.start_run():
    # Example: Log parameters
    mlflow.log_param("learning_rate", 0.01)
    mlflow.log_param("batch_size", 32)
    
    # Example: Log metrics
    for epoch in range(10):
        accuracy = xxxxxxxxxx  # Simulated accuracy
        loss = xxxxxxxxxx  # Simulated loss
        mlflow.log_metric("accuracy", accuracy, step=epoch)
        mlflow.log_metric("loss", loss, step=epoch)
    
    # Example: Save and log the model
    mlflow.sklearn.log_model(your_model, "model")

    # Optional: Log artifacts like plots or datasets
    mlflow.log_artifact("path/to/your/file")

seand0101 · December 26, 2024, 4:07pm

This has been a recurring problem because apparently I made my own id2labels.json, you can look at it if it adds more context in here. I did try “Getting all possible labels by grouping it after each inference” in this question for better context and the program actually confused at what to segment.

I tried making it 0 and 1 like what I had today and actually get probably a false positive due to small epoch on first training on here. So how is it really to make an id2label from a pretrained model id2label?

Thanks for the example, will try to see how to implement that in my code.

Btw if I just let the model training running, does the model actually broke or I’m just missing some stats I could probably circumvent by it’s products like maybe creating new masks of the new model and count the MIoU manually?

Alanturner2 · December 27, 2024, 2:47am

Did you correct your dataset? I’ve experienced your situation. At that time, I checked the dataset. Then I was surprised. I thought that I import the data, but the values are zero. It was just good example. I hope you correct your error!

John6666 · December 27, 2024, 5:07am

For now, I’ve built a dataset with the mask converted to grayscale for problem verification.

from datasets import load_dataset
import numpy as np

dsrepo1 = "John6666/segformer-b0-finetuned-ade-512-512-manggarai-watergate-2" # Grayscaled
dsrepo2 = "John6666/segformer-b0-finetuned-ade-512-512-manggarai-watergate-3" # Normalized
ds1 = load_dataset(dsrepo1, split="train[1:6]")
ds2 = load_dataset(dsrepo2, split="train[1:6]")

for i in range(5):
    print("Original:", np.array(ds1[i]["label"]))
    print("Normalized:", np.array(ds2[i]["label"]))

Original: [[  0   0   0 ...   0   0   0]
 [  0   0   0 ...   0   0   0]
 [  0   0   0 ...   0   0   0]
 ...
 [255 255 255 ...   0   0   0]
 [255 255 255 ...   0   0   0]
 [255 255 255 ...   0   0   0]]
Normalized: [[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [1 1 1 ... 0 0 0]
 [1 1 1 ... 0 0 0]
 [1 1 1 ... 0 0 0]]

Code for normalization

from datasets import load_dataset, Dataset
dataset_repo = "seand0101/segformer-b0-finetuned-ade-512-512-manggarai-watergate"
ds = load_dataset(dataset_repo, split="train")
newdsl = []
for i in range(ds.num_rows):
    newdsl.append({"pixel_values": ds[i]["pixel_values"], "label": Image.fromarray(np.array(ds[i]["label"].convert("L")) / 255).convert("L")})
newds = Dataset.from_list(newdsl) # normalized dataset

seand0101 · December 27, 2024, 7:26am

@Alanturner2 Did fix it with @John6666 and the program could run so I thought it’s no more about empty data…

@John6666 Why is the second dataset all black though, is it what it’s supposed to look like when normalized? As in white [1] but on pixel level only so it’s not really visible?

Btw thanks I’ll try to fit that into the code. Sorry for difference in timezone breaking the discussion. Will update later

John6666 · December 27, 2024, 7:29am

It’s just that you can’t see it, but there must be a color that is infinitely black but not black. Anyway, if you look at the image mask data for the HF sideway dataset in numpy, you can see that there seems to be a correspondence between the labels and the numbers indicating the gradation. I don’t know if this is right, but it should be close.

seand0101 · December 27, 2024, 8:04am

Yea I thought it’s all black but then I see your array examples, so it’s what normalized look like…

. Anyway, if you look at the image mask data for the HF sideway dataset in numpy, you can see that there seems to be a correspondence between the labels and the numbers indicating the gradation. I don’t know if this is right, but it should be close.

Do you mean the masking they did in sideway dataset actually also normalized the gradation or the data can actually retains variation even after normalization? Wait did I get that right

Btw which part of these

   newdsl.append({"pixel_values": ds[i]["pixel_values"], "label": Image.fromarray(np.array(ds[i]["label"].convert("L")) / 255).convert("L")})

that are akin to torch.nn.functional.normalize? division by 255 (is the 255 highest pixel value so basically normalization is making all on the same level relative to highest pixel value)?

John6666 · December 27, 2024, 8:46am

Before to after.
I think this is an example of preprocessing in the article on SegFormer. If you create data equivalent to After, you can probably use it for training as described in the article.

are akin to torch.nn.functional.normalize?

Well, to put it roughly, it’s the same. Because we know in advance that there are only 0 and 255, we can use this method without worrying about accuracy.

seand0101 · December 27, 2024, 9:11am

It’s weird, I made sure to fit the normalized dataset properly but the result is still NaN.

 newdsl.append({"pixel_values": ds[i]["pixel_values"], "label": Image.fromarray(np.array(ds[i]["label"].convert("L")) / 255).convert("L")})

Although in previous code it’s an Image array appended to a Dataset List, should we make the np.array characteristics of the dataset before calculations?

John6666 · December 27, 2024, 9:17am

It feels like a pain to have to go through the trouble of inserting detailed print statements and the like to understand the flow of data, but it’s a shortcut.

seand0101 · December 27, 2024, 9:32am

I was gonna add the printed image but after this division looks like I cannot see what’s inside again by going to print(np.array(train_ds['pixel_values']['label'][0])) after the division. I was assuming the train_test_split() won’t change the data format.

If that’s the case and the data stays normalized after the division, is it time we check the training parameters?

Successfully print the divided dataset, it stays the same and normalized

John6666 · December 27, 2024, 9:50am

Datasets are like this.

# newds["label"][1]
newds[1]["label"]

seand0101 · December 27, 2024, 9:56am

what, isn’t that in reversed since 1 is the index of the label column?

seand0101 · December 27, 2024, 10:02am

We did it boys, the accuracy not water still NaN for some reason. I thank thee all

Last thing I did was reverting this code back to it’s tutorial example

from torchvision.transforms import ColorJitter
from transformers import SegformerImageProcessor

processor = SegformerImageProcessor()
jitter = ColorJitter(brightness=0.25, contrast=0.25, saturation=0.25, hue=0.1) 

def train_transforms(example_batch):
    images = [jitter(x) for x in example_batch['pixel_values']]
    labels = [x for x in example_batch['label']]
    inputs = processor(images, labels)
    return inputs


def val_transforms(example_batch):
    images = [x for x in example_batch['pixel_values']]
    labels = [x for x in example_batch['label']]
    inputs = processor(images, labels)
    return inputs


# Set transforms
train_ds.set_transform(train_transforms)
test_ds.set_transform(val_transforms)

@Alanturner2 @John6666 Thank you to both of you, should we made the torch.nn.functional.normalize for posterity. I think John’s solution to normalization is bit simple but works.

Btw what I meant by “tutorial version” of that code was because previously we, me and John, made that code to convert it into B/W. Do you guys think it affects the normalized variables? Maybe I should check it after the training because I really don’t know which part of other code that makes it NaN after normalization.

For future readers, the answer is to always normalize your dataset.

O guys, one more thing. Is there any reading about judging these scores like when does this scores become abnormally bad or good? MIoU and other evaluation metric is rarely had any free readings except maybe this one

Alanturner2 · December 27, 2024, 10:13am

Thank you for @seand0101 and @John6666 .
I hope your success in your code. I will do my best to help our friends in this forum.
I was so happy when I solve the problem with our friends.

seand0101 · December 27, 2024, 10:17am

Thanks to you too! Hope our journeys in AI makes us wise

Topic		Replies	Views
Trainer doesn't show the loss at each step 🤗Transformers	20	33177	May 9, 2024
Determining size of a logits Beginners	0	16	December 4, 2024
Extracting Logits From T5 Output Beginners	5	1939	January 9, 2024
What outputs are defined when using custom compute_loss? Beginners	0	295	December 14, 2022
Inconsistency in logit values between generation and direct model prediction #31127 🤗Transformers	0	164	May 30, 2024

How to debug NaN output of a logits in training

Code for normalization

Related topics