ucsahin commited on
Commit
327b523
·
verified ·
1 Parent(s): d01757d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +31 -2
README.md CHANGED
@@ -27,8 +27,7 @@ It achieves the following results on the evaluation set:
27
 
28
  **Outputs:**
29
  - **Bounding Boxes:** The model outputs the location for the bounding box coordinates in the form of special <loc[value]> tokens, where value is a number that represents a normalized coordinate. Each detection is represented by four location coordinates in the order y_min, x_min, y_max, x_max, followed by the label that was detected in that box. To convert values to coordinates, you first need to divide the numbers by 1024, then multiply y by the image height and x by its width. This will give you the coordinates of the bounding boxes, relative to the original image size.
30
-
31
- You can use the following script to convert the model output into PASCAL VOC format.
32
 
33
  ```python
34
  import re
@@ -91,6 +90,36 @@ with torch.inference_mode():
91
  print(bbox_text)
92
  ```
93
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
94
  ## Bias, Risks, and Limitations
95
 
96
  <!-- This section is meant to convey both technical and sociotechnical limitations. -->
 
27
 
28
  **Outputs:**
29
  - **Bounding Boxes:** The model outputs the location for the bounding box coordinates in the form of special <loc[value]> tokens, where value is a number that represents a normalized coordinate. Each detection is represented by four location coordinates in the order y_min, x_min, y_max, x_max, followed by the label that was detected in that box. To convert values to coordinates, you first need to divide the numbers by 1024, then multiply y by the image height and x by its width. This will give you the coordinates of the bounding boxes, relative to the original image size.
30
+ If everything goes smoothly, the model will output a text similar to "<loc[value]><loc[value]><loc[value]><loc[value]> table; <loc[value]><loc[value]><loc[value]><loc[value]> table" depending on the number of tables detected in the image. Then, you can use the following script to convert the text output into PASCAL VOC format.
 
31
 
32
  ```python
33
  import re
 
90
  print(bbox_text)
91
  ```
92
 
93
+ **Warning:** You can also load a quantized 4-bit or 8-bit model using `bitsandbytes`. Beware though that the model can generate outputs that can require further post-processing for example five locations tags "<loc[value]>" instead of four, and different labels other than "table". The provided post-processing script should handle the first case.
94
+
95
+ Use the following to load the 4-bit quantized model:
96
+
97
+ ```python
98
+ from transformers import PaliGemmaForConditionalGeneration, PaliGemmaProcessor, BitsAndBytesConfig
99
+ import torch
100
+
101
+ model_id = "ucsahin/paligemma-3b-mix-448-ft-TableDetection"
102
+
103
+ device = "cuda:0"
104
+ dtype = torch.bfloat16
105
+
106
+ bnb_config = BitsAndBytesConfig(
107
+ load_in_4bit=True,
108
+ bnb_4bit_quant_type="nf4",
109
+ bnb_4bit_compute_dtype=dtype
110
+ )
111
+
112
+ model = PaliGemmaForConditionalGeneration.from_pretrained(
113
+ model_id,
114
+ torch_dtype=dtype,
115
+ device_map=device,
116
+ quantization_config=bnb_config
117
+ )
118
+
119
+ processor = PaliGemmaProcessor.from_pretrained(model_id)
120
+ ```
121
+
122
+
123
  ## Bias, Risks, and Limitations
124
 
125
  <!-- This section is meant to convey both technical and sociotechnical limitations. -->