Update README.md
Browse files
README.md
CHANGED
@@ -27,8 +27,7 @@ It achieves the following results on the evaluation set:
|
|
27 |
|
28 |
**Outputs:**
|
29 |
- **Bounding Boxes:** The model outputs the location for the bounding box coordinates in the form of special <loc[value]> tokens, where value is a number that represents a normalized coordinate. Each detection is represented by four location coordinates in the order y_min, x_min, y_max, x_max, followed by the label that was detected in that box. To convert values to coordinates, you first need to divide the numbers by 1024, then multiply y by the image height and x by its width. This will give you the coordinates of the bounding boxes, relative to the original image size.
|
30 |
-
|
31 |
-
You can use the following script to convert the model output into PASCAL VOC format.
|
32 |
|
33 |
```python
|
34 |
import re
|
@@ -91,6 +90,36 @@ with torch.inference_mode():
|
|
91 |
print(bbox_text)
|
92 |
```
|
93 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
94 |
## Bias, Risks, and Limitations
|
95 |
|
96 |
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
|
|
|
27 |
|
28 |
**Outputs:**
|
29 |
- **Bounding Boxes:** The model outputs the location for the bounding box coordinates in the form of special <loc[value]> tokens, where value is a number that represents a normalized coordinate. Each detection is represented by four location coordinates in the order y_min, x_min, y_max, x_max, followed by the label that was detected in that box. To convert values to coordinates, you first need to divide the numbers by 1024, then multiply y by the image height and x by its width. This will give you the coordinates of the bounding boxes, relative to the original image size.
|
30 |
+
If everything goes smoothly, the model will output a text similar to "<loc[value]><loc[value]><loc[value]><loc[value]> table; <loc[value]><loc[value]><loc[value]><loc[value]> table" depending on the number of tables detected in the image. Then, you can use the following script to convert the text output into PASCAL VOC format.
|
|
|
31 |
|
32 |
```python
|
33 |
import re
|
|
|
90 |
print(bbox_text)
|
91 |
```
|
92 |
|
93 |
+
**Warning:** You can also load a quantized 4-bit or 8-bit model using `bitsandbytes`. Beware though that the model can generate outputs that can require further post-processing for example five locations tags "<loc[value]>" instead of four, and different labels other than "table". The provided post-processing script should handle the first case.
|
94 |
+
|
95 |
+
Use the following to load the 4-bit quantized model:
|
96 |
+
|
97 |
+
```python
|
98 |
+
from transformers import PaliGemmaForConditionalGeneration, PaliGemmaProcessor, BitsAndBytesConfig
|
99 |
+
import torch
|
100 |
+
|
101 |
+
model_id = "ucsahin/paligemma-3b-mix-448-ft-TableDetection"
|
102 |
+
|
103 |
+
device = "cuda:0"
|
104 |
+
dtype = torch.bfloat16
|
105 |
+
|
106 |
+
bnb_config = BitsAndBytesConfig(
|
107 |
+
load_in_4bit=True,
|
108 |
+
bnb_4bit_quant_type="nf4",
|
109 |
+
bnb_4bit_compute_dtype=dtype
|
110 |
+
)
|
111 |
+
|
112 |
+
model = PaliGemmaForConditionalGeneration.from_pretrained(
|
113 |
+
model_id,
|
114 |
+
torch_dtype=dtype,
|
115 |
+
device_map=device,
|
116 |
+
quantization_config=bnb_config
|
117 |
+
)
|
118 |
+
|
119 |
+
processor = PaliGemmaProcessor.from_pretrained(model_id)
|
120 |
+
```
|
121 |
+
|
122 |
+
|
123 |
## Bias, Risks, and Limitations
|
124 |
|
125 |
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
|