This is a quantized version of https://huggingface.co/laion/CLIP-ViT-B-32-256x256-DataComp-s34B-b86K that is ready to use with (DeepSparse)[https://github.com/neuralmagic/deepsparse] It achieves 71.1% one-shot accuracy on ImageNet. ## Usage First, install DeepSparse with extensions for CLIP: ``` pip install deepsparse-nightly[clip]>=1.7.0.20231210 ``` Download some test images of a church, a dog, and elephants: ``` wget -O basilica.jpg https://raw.githubusercontent.com/neuralmagic/deepsparse/main/src/deepsparse/yolo/sample_images/basilica.jpg wget -O buddy.jpeg https://raw.githubusercontent.com/neuralmagic/deepsparse/main/tests/deepsparse/pipelines/sample_images/buddy.jpeg wget -O thailand.jpg https://raw.githubusercontent.com/neuralmagic/deepsparse/main/src/deepsparse/yolact/sample_images/thailand.jpg ``` For this model there is a second input that is the length of tokens, so run this input override before making the pipeline: ```python import numpy as np from deepsparse.clip import CLIPTextPipeline def custom_process_inputs(self, inputs): if not isinstance(inputs.text, list): inputs.text = [inputs.text] if not isinstance(inputs.text[0], str): return inputs.text tokens = [np.array(t).astype(np.int32) for t in self.tokenizer(inputs.text)] tokens = np.stack(tokens, axis=0) tokens_lengths = np.array(tokens.shape[0] * [tokens.shape[1] - 1]) return [tokens, tokens_lengths] # This overrides the process_inputs function globally for all CLIPTextPipeline classes CLIPTextPipeline.process_inputs = custom_process_inputs ``` Then make and run a pipeline in Python: ```python from deepsparse import Pipeline from deepsparse.clip import ( CLIPTextInput, CLIPVisualInput, CLIPZeroShotInput ) from huggingface_hub import snapshot_download # Download the model from HF model_folder = snapshot_download(repo_id="mgoin/CLIP-ViT-B-32-256x256-DataComp-s34B-b86K-quant-ds") possible_classes = ["ice cream", "an elephant", "a dog", "a building", "a church"] images = ["basilica.jpg", "buddy.jpeg", "thailand.jpg"] # Load the model into DeepSparse pipeline = Pipeline.create( task="clip_zeroshot", visual_model_path=model_folder + "/visual.onnx", text_model_path=model_folder + "/textual.onnx" ) output = pipeline( image=CLIPVisualInput(images=images), text=CLIPTextInput(text=possible_classes), ).text_scores for i in range(len(output)): prediction = possible_classes[np.argmax(output[i])] print(f"Image {images[i]} is a picture of {prediction}") """ Image basilica.jpg is a picture of a church Image buddy.jpeg is a picture of a dog Image thailand.jpg is a picture of an elephant """ ```