Salesforce/blip-image-captioning-large · Is finetuning supported for this model? If so can people give me some pointers?

Hi @husjerry
Fine-tuning the VQA should be done in the same way than the image captioning fine-tuning, I think the only difference is on the way you prompt the model (but I am not sure).
You can have a look at the instructions shared on that thread or refer to the original vqa fine-tuning script:; https://github.com/salesforce/BLIP/blob/main/train_vqa.py and try to use the HF model or use their model, then convert it to HF version using the conversion script here: https://github.com/huggingface/transformers/blob/main/src/transformers/models/blip/convert_blip_original_pytorch_to_hf.py