I am trying to build an OCR image and have saved some image - text pairs in a dataset and uploaded them into huggingface repo. While fetching the same repo to another machine, it shows that some images are corrupted with the PIL.UnidentifiedimageError . I have
Using the SQL checked if the image is really corrupted. It isn’t.
Just to be on the safe side removed (12 instances which are shown to be “corrupted”) and reuploaded them after removing the rows. Now 12 different rows are being shown as “corrupted” . I have also checked these ones too . They are being perfectly rendered on the website.
To try to mitigate this issue, I have updated the datasets version to the latest (3.2.0) on both the uploading system and the compute system that is downloading the data. Unfortunately this is where I am stuck.
Thank you for pointing me to datasets github repo. After a lot of debugging of many 100s of images in the dataset, I found that the bug is actually in my augmentation part of the code. The issue is with not accounting for race conditions.