# ROCO-idefics3-8b

This notebook fine-tunes [Idefics3-8B-Llama3](https://huggingface.co/HuggingFaceM4/Idefics3-8B-Llama3) model. The source model is fine-tuned on the [Radiology Objects in Context (ROCO)](https://huggingface.co/datasets/eltorio/ROCO-radiology) dataset, a large-scale medical and multimodal imaging collection.  

The fine-tuning process stores the model checkpoints on a regular basis. Re run the notebook from the last checkpoint to continue the fine-tuning process.

## Try to mount Google Drive

In [None]:
try:
  import google.colab
  from google.colab import drive
  drive.mount('/content/drive')
  
except ModuleNotFoundError:
  raise Exception("You are not running this code in Google Colab. Please use Google Colab if you would like to save the model to Google Drive")

## Fine-tuning parameters

In [None]:
dataset_id = "eltorio/ROCO-radiology"
prompt= "You are an expert radiologist certified with over 15 years of experience in diagnostic imaging, describe this image"
source_model_id = "HuggingFaceM4/Idefics3-8B-Llama3"
destination_model_id = "eltorio/ROCO-idefics3-8B"
# if Google Drive is mounted, the model will be saved in a folder called IDEFICS3_ROCO in the root of your Google Drive
# else the model will be saved in the current working directory in a folder called IDEFICS3_ROCO
if 'drive' in globals():
    output_dir = "/content/drive/MyDrive/IDEFICS3_ROCO"
else:
    output_dir = "IDEFICS3_ROCO"

## Login on Hugging Face

In [None]:
!git config --global credential.helper store
%pip install huggingface_hub

In [None]:
from huggingface_hub import login
import os

HF_TOKEN = ""

if os.environ.get('HF_TOKEN') is not None:
  HF_TOKEN = os.environ.get('HF_TOKEN')
  print(f"Hugging Face token found in environment variable")
try:
  import google.colab
  from google.colab import userdata
  if (userdata.get('HF_TOKEN') is not None) and (HF_TOKEN == ""):
    HF_TOKEN = userdata.get('HF_TOKEN')
  else:
    raise ValueError("Please set your Hugging Face token in the user data panel, or pass it as an environment variable")
except ModuleNotFoundError:
  if HF_TOKEN is None:
    raise ValueError("Please set your Hugging Face token in the user data panel, or pass it as an environment variable")

login(
  token=HF_TOKEN,
  add_to_git_credential=True
)

##### Optionally clone the model repository

In [None]:
# clone Hugging Face model repository
!git clone https://huggingface.co/$destination_model_id $output_dir

### Step 1: Install libraries and dependencies.

In [7]:
%pip install -q git+https://github.com/huggingface/transformers.git
%pip install -q accelerate datasets peft
%pip install -q bitsandbytes

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


### Step 2: Retrieve the dataset from Hugging Face.

In [8]:
from datasets import load_dataset

train_dataset = load_dataset(dataset_id, split="train")

README.md:   0%|          | 0.00/1.11k [00:00<?, ?B/s]

Resolving data files:   0%|          | 0/26 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/26 [00:00<?, ?it/s]

Downloading data:   0%|          | 0/26 [00:00<?, ?files/s]

train-00000-of-00026.parquet:   0%|          | 0.00/498M [00:00<?, ?B/s]

train-00001-of-00026.parquet:   0%|          | 0.00/486M [00:00<?, ?B/s]

train-00002-of-00026.parquet:   0%|          | 0.00/489M [00:00<?, ?B/s]

train-00003-of-00026.parquet:   0%|          | 0.00/501M [00:00<?, ?B/s]

train-00004-of-00026.parquet:   0%|          | 0.00/491M [00:00<?, ?B/s]

train-00005-of-00026.parquet:   0%|          | 0.00/492M [00:00<?, ?B/s]

train-00006-of-00026.parquet:   0%|          | 0.00/498M [00:00<?, ?B/s]

train-00007-of-00026.parquet:   0%|          | 0.00/488M [00:00<?, ?B/s]

train-00008-of-00026.parquet:   0%|          | 0.00/488M [00:00<?, ?B/s]

train-00009-of-00026.parquet:   0%|          | 0.00/488M [00:00<?, ?B/s]

train-00010-of-00026.parquet:   0%|          | 0.00/498M [00:00<?, ?B/s]

train-00011-of-00026.parquet:   0%|          | 0.00/494M [00:00<?, ?B/s]

train-00012-of-00026.parquet:   0%|          | 0.00/488M [00:00<?, ?B/s]

train-00013-of-00026.parquet:   0%|          | 0.00/493M [00:00<?, ?B/s]

train-00014-of-00026.parquet:   0%|          | 0.00/491M [00:00<?, ?B/s]

train-00015-of-00026.parquet:   0%|          | 0.00/496M [00:00<?, ?B/s]

train-00016-of-00026.parquet:   0%|          | 0.00/499M [00:00<?, ?B/s]

train-00017-of-00026.parquet:   0%|          | 0.00/494M [00:00<?, ?B/s]

train-00018-of-00026.parquet:   0%|          | 0.00/491M [00:00<?, ?B/s]

train-00019-of-00026.parquet:   0%|          | 0.00/496M [00:00<?, ?B/s]

train-00020-of-00026.parquet:   0%|          | 0.00/498M [00:00<?, ?B/s]

train-00021-of-00026.parquet:   0%|          | 0.00/488M [00:00<?, ?B/s]

train-00022-of-00026.parquet:   0%|          | 0.00/476M [00:00<?, ?B/s]

train-00023-of-00026.parquet:   0%|          | 0.00/499M [00:00<?, ?B/s]

train-00024-of-00026.parquet:   0%|          | 0.00/491M [00:00<?, ?B/s]

train-00025-of-00026.parquet:   0%|          | 0.00/506M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/276M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/273M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/65423 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/8175 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/8176 [00:00<?, ? examples/s]

Loading dataset shards:   0%|          | 0/26 [00:00<?, ?it/s]

### Step 3: Test the data for detecting wrong Pillow version.

In [17]:
train_dataset[len(train_dataset)-4]

{'image': <PIL.PngImagePlugin.PngImageFile image mode=RGB size=1880x2311>,
 'image_id': 'RONA_00001',
 'caption': 'Right shoulder of a 50-year-old patient showing an anterior dislocated shoulder.'}

In [10]:
train_dataset[len(train_dataset)-4]['image']

Output hidden; open in https://colab.research.google.com to view.

### Step 4: Configure LoRA adapters

In [18]:
import torch
from peft import LoraConfig
from transformers import AutoProcessor, BitsAndBytesConfig, Idefics3ForConditionalGeneration

DEVICE = "cuda:0"
USE_LORA = False
USE_QLORA = True

processor = AutoProcessor.from_pretrained(
    source_model_id,
    do_image_splitting=False
)

if USE_QLORA or USE_LORA:
    lora_config = LoraConfig(
        r=8,
        lora_alpha=8,
        lora_dropout=0.1,
        target_modules='.*(text_model|modality_projection|perceiver_resampler).*(down_proj|gate_proj|up_proj|k_proj|q_proj|v_proj|o_proj).*$',
        use_dora=False if USE_QLORA else True,
        init_lora_weights="gaussian"
    )
    if USE_QLORA:
        bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.float16
        )
    model = Idefics3ForConditionalGeneration.from_pretrained(
        source_model_id,
        torch_dtype=torch.float16,
        quantization_config=bnb_config if USE_QLORA else None,
    )
    model.add_adapter(lora_config)
    model.enable_adapters()
else:
    model = Idefics3ForConditionalGeneration.from_pretrained(
        source_model_id,
        torch_dtype=torch.float16,
        _attn_implementation="flash_attention_2", # This works for A100 or H100
    ).to(DEVICE)

`low_cpu_mem_usage` was None, now default to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

### Step 5: Create Data Collator for IDEFICS3 format.

In [23]:
class MyDataCollator:
    def __init__(self, processor):
        self.processor = processor
        self.image_token_id = processor.tokenizer.additional_special_tokens_ids[
            processor.tokenizer.additional_special_tokens.index("<image>")
        ]

    def __call__(self, samples):
        texts = []
        images = []
        for sample in samples:
            image = sample["image"]
            answer = sample["caption"]
            messages = [
                {
                    "role": "system",
                    "content": [
                        {"type": "text", "text": prompt}
                    ]

                },
                {
                    "role": "user",
                    "content": [
                        {"type": "image"},
                    ]
                },
                {
                    "role": "assistant",
                    "content": [
                        {"type": "text", "text": answer}
                    ]
                }
            ]
            text = processor.apply_chat_template(messages, add_generation_prompt=False)
            texts.append(text.strip())
            images.append([image.convert('RGB')])

        batch = processor(text=texts, images=images, return_tensors="pt", padding=True)

        labels = batch["input_ids"].clone()
        labels[labels == processor.tokenizer.pad_token_id] = self.image_token_id
        batch["labels"] = labels

        return batch

data_collator = MyDataCollator(processor)

### Step 6: Setup training parameters

In [None]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir = output_dir,
    overwrite_output_dir = False,
    auto_find_batch_size = True,
    learning_rate = 2e-4,
    fp16 = True,
    per_device_train_batch_size = 2,
    per_device_eval_batch_size = 2,
    gradient_accumulation_steps = 8,
    dataloader_pin_memory = False,
    save_total_limit = 3,
    evaluation_strategy = None,
    save_strategy = "steps",
    eval_steps = 100,
    save_steps = 10, # checkpoint each 10 steps
    resume_from_checkpoint = True,
    logging_steps = 5,
    remove_unused_columns = False,
    push_to_hub = True,
    label_names = ["labels"],
    load_best_model_at_end = False,
    report_to = "none",
    optim = "paged_adamw_8bit",
)

In [25]:
trainer = Trainer(
    model = model,
    args = training_args,
    data_collator = data_collator,
    train_dataset = train_dataset,
)

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


### Step 7: Start (or restart) Training

In [None]:
trainer.train(resume_from_checkpoint = True)

Step,Training Loss


Step,Training Loss
5,4.2975
10,0.4263
15,0.4014
20,0.3397
25,0.3572
30,0.3628
35,0.3096
40,0.3353
45,0.3733
50,0.2958


