ucsahin
/

paligemma-3b-mix-448-ft-TableDetection

Image-Text-to-Text

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

ucsahin commited on May 26, 2024

Commit

327b523

·

verified ·

1 Parent(s): d01757d

Update README.md

Files changed (1) hide show

README.md +31 -2

README.md CHANGED Viewed

@@ -27,8 +27,7 @@ It achieves the following results on the evaluation set:
 **Outputs:**
 - **Bounding Boxes:** The model outputs the location for the bounding box coordinates in the form of special <loc[value]> tokens, where value is a number that represents a normalized coordinate. Each detection is represented by four location coordinates in the order y_min, x_min, y_max, x_max, followed by the label that was detected in that box. To convert values to coordinates, you first need to divide the numbers by 1024, then multiply y by the image height and x by its width. This will give you the coordinates of the bounding boxes, relative to the original image size.
-You can use the following script to convert the model output into PASCAL VOC format.
 ```python
 import re
@@ -91,6 +90,36 @@ with torch.inference_mode():
     print(bbox_text)
 ```
 ## Bias, Risks, and Limitations
 <!-- This section is meant to convey both technical and sociotechnical limitations. -->

 **Outputs:**
 - **Bounding Boxes:** The model outputs the location for the bounding box coordinates in the form of special <loc[value]> tokens, where value is a number that represents a normalized coordinate. Each detection is represented by four location coordinates in the order y_min, x_min, y_max, x_max, followed by the label that was detected in that box. To convert values to coordinates, you first need to divide the numbers by 1024, then multiply y by the image height and x by its width. This will give you the coordinates of the bounding boxes, relative to the original image size.
+If everything goes smoothly, the model will output a text similar to "<loc[value]><loc[value]><loc[value]><loc[value]> table; <loc[value]><loc[value]><loc[value]><loc[value]> table" depending on the number of tables detected in the image. Then, you can use the following script to convert the text output into PASCAL VOC format.
 ```python
 import re
     print(bbox_text)
 ```
+**Warning:** You can also load a quantized 4-bit or 8-bit model using `bitsandbytes`. Beware though that the model can generate outputs that can require further post-processing for example five locations tags "<loc[value]>" instead of four, and different labels other than "table". The provided post-processing script should handle the first case.
+Use the following to load the 4-bit quantized model:
+```python
+from transformers import PaliGemmaForConditionalGeneration, PaliGemmaProcessor, BitsAndBytesConfig
+import torch
+model_id = "ucsahin/paligemma-3b-mix-448-ft-TableDetection"
+device = "cuda:0"
+dtype = torch.bfloat16
+bnb_config = BitsAndBytesConfig(
+        load_in_4bit=True,
+        bnb_4bit_quant_type="nf4",
+        bnb_4bit_compute_dtype=dtype
+)
+model = PaliGemmaForConditionalGeneration.from_pretrained(
+    model_id,
+    torch_dtype=dtype,
+    device_map=device,
+    quantization_config=bnb_config
+)
+processor = PaliGemmaProcessor.from_pretrained(model_id)
+```
 ## Bias, Risks, and Limitations
 <!-- This section is meant to convey both technical and sociotechnical limitations. -->