Handle errors when loading images (404, corrupted, etc)

nlassaux · August 10, 2023, 6:08pm

Hello, I am loading images into a Dataset, by casting their urls as datasets.Image objects.

def load_dataset(db_client: DBClient) -> Dataset:
    """Loads the dataset from the given bucket."""
    paths = db_client.missing_image_paths()
    paths = list(paths)

    def url_from_path(path: str) -> str:
        return f'gs://{BUCKET}/{FOLDER}{path}'
    
    return Dataset.from_dict({
        'image': [url_from_path(path) for path in paths],
        'filename': paths
    }).cast_column('image', Image())

Now, some of these images don’t exist anymore. So with print(dataset[0]), I get:
{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=1920x1280 at 0x132D99190>, 'filename': 'b57ed2793e6a8ae06382c78a87863b8d.jpg'}

But if I try to load more, at some point, I get a message similar to that: PIL.UnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x13592fab0>

Is there some way to specify that we want to ignore those issues and discard the images when that happens?

mariosasko · August 16, 2023, 3:35pm

Hi! You can remove invalid image files with

dataset = dataset.cast_column("image", datasets.Image(decode=False))

def has_valid_image(ex):
    try:
        PIL.Image.open(ex["image"]["path"])
    except Exception:
        return False
    return True

dataset = dataset.filter(has_valid_image)
dataset = dataset.cast_column("image", datasets.Image(decode=True))

nlassaux · August 16, 2023, 3:37pm

Hi @mariosasko, and thanks. It would work, but require to load images twice. I wonder if there is any way to pass on the loading on the fly instead, to divide the amount of work by 2.

mariosasko · August 16, 2023, 4:15pm

This is the lazy approach

dataset = dataset.cast_column("image", datasets.Image(decode=False))

def invalid_images_as_none(batch):
    images = []
    for image in batch["image"]:
       try:
           image = PIL.Image.open(image["path"])
       except Exception:
           image = None
       images.append(image)
    batch["image"] = image
    return batch

dataset = dataset.with_transform(invalid_images_as_none)

nlassaux · August 17, 2023, 8:50am

Thank you! @mariosasko

Topic		Replies	Views
Handling non-existing url in image dataset while cast_column 🤗Datasets	2	301	January 16, 2024
Handling decoding errors such as UnidentifiedImageError 🤗Datasets	4	697	December 13, 2023
Issues in loading image from dataset Beginners	3	974	January 22, 2024
Turn of automatic Pil image generation in load_dataset 🤗Datasets	2	16	August 21, 2024
TypeError due to load_dataset Beginners	2	543	October 23, 2023

Handle errors when loading images (404, corrupted, etc)

Related topics