Unable to load images

Hello all!

I am trying to build an OCR image and have saved some image - text pairs in a dataset and uploaded them into huggingface repo. While fetching the same repo to another machine, it shows that some images are corrupted with the PIL.UnidentifiedimageError . I have

  • Using the SQL checked if the image is really corrupted. It isn’t.
  • Just to be on the safe side removed (12 instances which are shown to be “corrupted”) and reuploaded them after removing the rows. Now 12 different rows are being shown as “corrupted” . I have also checked these ones too . They are being perfectly rendered on the website.

To try to mitigate this issue, I have updated the datasets version to the latest (3.2.0) on both the uploading system and the compute system that is downloading the data. Unfortunately this is where I am stuck.

Is anyone else facing any similar issue ?

Thanks a lot.

1 Like

I’ve never heard of that error before, but the behavior seems like a bug…
It might be quicker to raise an issue on github.

Also, if the image data set is extremely large, there seems to be a trick to how to create it.

Thank you for pointing me to datasets github repo. After a lot of debugging of many 100s of images in the dataset, I found that the bug is actually in my augmentation part of the code. The issue is with not accounting for race conditions.

1 Like