Dear community members and @John6666 ,
I have successfully find a way to train, continuation from past questions, and finds a problem that I could not debug using “print” method.
I found that some of the variables are nan or empty, I did find from this topic that the learning rate or even the data shape difference between GPUs can affect the calculations. Where do I start to debug something like this?
I did know and want to look at these bit of code
import torch
from torch import nn
import evaluate
metric = evaluate.load("mean_iou")
def compute_metrics(eval_pred):
with torch.no_grad():
logits, labels = eval_pred
logits_tensor = torch.from_numpy(logits)
# scale the logits to the size of the label
logits_tensor = nn.functional.interpolate(
logits_tensor,
size=labels.shape[-2:],
mode="bilinear",
align_corners=False,
).argmax(dim=1)
pred_labels = logits_tensor.detach().cpu().numpy()
# currently using _compute instead of compute
# see this issue for more info: https://github.com/huggingface/evaluate/pull/328#issuecomment-1286866576
metrics = metric._compute(
predictions=pred_labels,
references=labels,
num_labels=num_labels,
ignore_index=0
)
print("metrics:" metrics)
# add per category metrics as individual key-value pairs
per_category_accuracy = metrics.pop("per_category_accuracy").tolist()
per_category_iou = metrics.pop("per_category_iou").tolist()
metrics.update({f"accuracy_{id2label[i]}": v for i, v in enumerate(per_category_accuracy)})
metrics.update({f"iou_{id2label[i]}": v for i, v in enumerate(per_category_iou)})
return metrics
print(len(id2label))
Like probably between all these variables there’s something I need to readdress (not in my dataset, out of bounds again) or convert to other data types (int for example, or it’s still in a List and unretrievable). I did try to print it along the functions but looks like if it’s not returned then it won’t show in the prompt.
wait let me do that,
huh but this part is almost hidden and all a black box
from transformers import Trainer
trainer = Trainer(
model=model,
args=training_args,
tokenizer=feature_extractor,
train_dataset=train_ds,
eval_dataset=test_ds,
compute_metrics=compute_metrics,
)
Like how do I know the output of theses
New error after several tweaking
/root/.cache/huggingface/modules/evaluate_modules/metrics/evaluate-metric--mean_iou/9e450724f21f05592bfb0255fe2fa576df8171fa060d11121d8aecfff0db80d0/mean_iou.py:258: RuntimeWarning: invalid value encountered in scalar divide
all_acc = total_area_intersect.sum() / total_area_label.sum()
/root/.cache/huggingface/modules/evaluate_modules/metrics/evaluate-metric--mean_iou/9e450724f21f05592bfb0255fe2fa576df8171fa060d11121d8aecfff0db80d0/mean_iou.py:260: RuntimeWarning: invalid value encountered in divide
acc = total_area_intersect / total_area_label
/root/.cache/huggingface/modules/evaluate_modules/metrics/evaluate-metric--mean_iou/9e450724f21f05592bfb0255fe2fa576df8171fa060d11121d8aecfff0db80d0/mean_iou.py:263: RuntimeWarning: Mean of empty slice
metrics["mean_accuracy"] = np.nanmean(acc)
Metrics before post-processing: {'mean_iou': 0.0, 'mean_accuracy': nan, 'overall_accuracy': nan, 'per_category_iou': array([0., 0.]), 'per_category_accuracy': array([nan, nan])}
Does this mean there’s problem with my mask and pictures again? Feel free to ask several codes if its incomplete for reasoning