Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,177 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
tags:
|
3 |
+
- ocr
|
4 |
+
- vision
|
5 |
+
---
|
6 |
+
**Note:** ORIGINAL MODEL REPO: https://github.com/Ucas-HaoranWei/GOT-OCR2.0
|
7 |
+
|
8 |
+
---
|
9 |
+
|
10 |
+
<h3><a href="">General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model</a></h3>
|
11 |
+
|
12 |
+
<a href="https://github.com/Ucas-HaoranWei/GOT-OCR2.0/"><img src="https://img.shields.io/badge/Project-Page-Green"></a>
|
13 |
+
<a href="https://arxiv.org/abs/2409.01704"><img src="https://img.shields.io/badge/Paper-PDF-orange"></a>
|
14 |
+
<a href="https://github.com/Ucas-HaoranWei/GOT-OCR2.0/blob/main/assets/wechat.jpg"><img src="https://img.shields.io/badge/Wechat-blue"></a>
|
15 |
+
<a href="https://zhuanlan.zhihu.com/p/718163422"><img src="https://img.shields.io/badge/zhihu-red"></a>
|
16 |
+
|
17 |
+
[Haoran Wei*](https://scholar.google.com/citations?user=J4naK0MAAAAJ&hl=en), Chenglong Liu*, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu, [Zheng Ge](https://joker316701882.github.io/), Liang Zhao, [Jianjian Sun](https://scholar.google.com/citations?user=MVZrGkYAAAAJ&hl=en), [Yuang Peng](https://scholar.google.com.hk/citations?user=J0ko04IAAAAJ&hl=zh-CN&oi=ao), Chunrui Han, [Xiangyu Zhang](https://scholar.google.com/citations?user=yuB-cfoAAAAJ&hl=en)
|
18 |
+
|
19 |
+
<p align="center">
|
20 |
+
<img src="assets/got_logo.png" style="width: 200px" align=center>
|
21 |
+
</p>
|
22 |
+
|
23 |
+
|
24 |
+
## Release
|
25 |
+
|
26 |
+
- [2024/9/03]🔥🔥🔥 We open-source the codes, weights, and benchmarks. The paper can be found in this [repo](https://github.com/Ucas-HaoranWei/GOT-OCR2.0/blob/main/GOT-OCR-2.0-paper.pdf). We also have submitted it to Arxiv.
|
27 |
+
- [2024/9/03]🔥🔥🔥 We release the OCR-2.0 model GOT!
|
28 |
+
|
29 |
+
|
30 |
+
[![Code License](https://img.shields.io/badge/Code%20License-Apache_2.0-green.svg)](https://github.com/tatsu-lab/stanford_alpaca/blob/main/LICENSE)
|
31 |
+
[![Data License](https://img.shields.io/badge/Data%20License-CC%20By%20NC%204.0-red.svg)](https://github.com/tatsu-lab/stanford_alpaca/blob/main/DATA_LICENSE)
|
32 |
+
|
33 |
+
**Usage and License Notices**: The data, code, and checkpoint are intended and licensed for research use only. They are also restricted to use that follow the license agreement of Vary.
|
34 |
+
|
35 |
+
|
36 |
+
## Community contributions
|
37 |
+
We encourage everyone to develop GOT applications based on this repo. Thanks for the following contributions :
|
38 |
+
|
39 |
+
[Colab of GOT](https://colab.research.google.com/drive/1nmiNciZ5ugQVp4rFbL9ZWpEPd92Y9o7p?usp=sharing) ~ contributor: [@Zizhe Wang](https://github.com/PaperPlaneDeemo)
|
40 |
+
|
41 |
+
## Contents
|
42 |
+
- [Install](#install)
|
43 |
+
- [GOT Weights](#got-weights)
|
44 |
+
- [Demo](#demo)
|
45 |
+
- [Train](#train)
|
46 |
+
- [Eval](#eval)
|
47 |
+
|
48 |
+
***
|
49 |
+
<p align="center">
|
50 |
+
<img src="assets/got_support.jpg" style="width: 800px" align=center>
|
51 |
+
</p>
|
52 |
+
<p align="center">
|
53 |
+
<a href="">Towards OCR-2.0 via a Unified End-to-end Model</a>
|
54 |
+
</p>
|
55 |
+
|
56 |
+
***
|
57 |
+
|
58 |
+
|
59 |
+
## Install
|
60 |
+
0. Our environment is cuda11.8+torch2.0.1
|
61 |
+
1. Clone this repository and navigate to the GOT folder
|
62 |
+
```bash
|
63 |
+
git clone https://github.com/Ucas-HaoranWei/GOT-OCR2.0.git
|
64 |
+
cd 'the GOT folder'
|
65 |
+
```
|
66 |
+
2. Install Package
|
67 |
+
```Shell
|
68 |
+
conda create -n got python=3.10 -y
|
69 |
+
conda activate got
|
70 |
+
pip install -e .
|
71 |
+
```
|
72 |
+
|
73 |
+
3. Install Flash-Attention
|
74 |
+
```
|
75 |
+
pip install ninja
|
76 |
+
pip install flash-attn --no-build-isolation
|
77 |
+
```
|
78 |
+
## GOT Weights
|
79 |
+
- [Google Drive](https://drive.google.com/drive/folders/1OdDtsJ8bFJYlNUzCQG4hRkUL6V-qBQaN?usp=sharing)
|
80 |
+
- [BaiduYun](https://pan.baidu.com/s/1G4aArpCOt6I_trHv_1SE2g) code: OCR2
|
81 |
+
|
82 |
+
## Demo
|
83 |
+
1. plain texts OCR:
|
84 |
+
```Shell
|
85 |
+
python3 GOT/demo/run_ocr_2.0.py --model-name /GOT_weights/ --image-file /an/image/file.png --type ocr
|
86 |
+
```
|
87 |
+
2. format texts OCR:
|
88 |
+
```Shell
|
89 |
+
python3 GOT/demo/run_ocr_2.0.py --model-name /GOT_weights/ --image-file /an/image/file.png --type format
|
90 |
+
```
|
91 |
+
3. fine-grained OCR:
|
92 |
+
```Shell
|
93 |
+
python3 GOT/demo/run_ocr_2.0.py --model-name /GOT_weights/ --image-file /an/image/file.png --type format/ocr --box [x1,y1,x2,y2]
|
94 |
+
```
|
95 |
+
```Shell
|
96 |
+
python3 GOT/demo/run_ocr_2.0.py --model-name /GOT_weights/ --image-file /an/image/file.png --type format/ocr --color red/green/blue
|
97 |
+
```
|
98 |
+
4. multi-crop OCR:
|
99 |
+
```Shell
|
100 |
+
python3 GOT/demo/run_ocr_2.0_crop.py --model-name /GOT_weights/ --image-file /an/image/file.png
|
101 |
+
```
|
102 |
+
5. multi-page OCR (the image path contains multiple .png files):
|
103 |
+
```Shell
|
104 |
+
python3 GOT/demo/run_ocr_2.0_crop.py --model-name /GOT_weights/ --image-file /images/path/ --multi-page
|
105 |
+
```
|
106 |
+
6. render the formatted OCR results:
|
107 |
+
```Shell
|
108 |
+
python3 GOT/demo/run_ocr_2.0.py --model-name /GOT_weights/ --image-file /an/image/file.png --type format --render
|
109 |
+
```
|
110 |
+
**Note**:
|
111 |
+
The rendering results can be found in /results/demo.html. Please open the demo.html to see the results.
|
112 |
+
|
113 |
+
|
114 |
+
## Train
|
115 |
+
1. This codebase only supports post-training (stage-2/stage-3) upon our GOT weights.
|
116 |
+
2. If you want train from stage-1 described in our paper, you need this [repo](https://github.com/Ucas-HaoranWei/Vary-tiny-600k).
|
117 |
+
|
118 |
+
```Shell
|
119 |
+
deepspeed /GOT-OCR-2.0-master/GOT/train/train_GOT.py \
|
120 |
+
--deepspeed /GOT-OCR-2.0-master/zero_config/zero2.json --model_name_or_path /GOT_weights/ \
|
121 |
+
--use_im_start_end True \
|
122 |
+
--bf16 True \
|
123 |
+
--gradient_accumulation_steps 2 \
|
124 |
+
--evaluation_strategy "no" \
|
125 |
+
--save_strategy "steps" \
|
126 |
+
--save_steps 200 \
|
127 |
+
--save_total_limit 1 \
|
128 |
+
--weight_decay 0. \
|
129 |
+
--warmup_ratio 0.001 \
|
130 |
+
--lr_scheduler_type "cosine" \
|
131 |
+
--logging_steps 1 \
|
132 |
+
--tf32 True \
|
133 |
+
--model_max_length 8192 \
|
134 |
+
--gradient_checkpointing True \
|
135 |
+
--dataloader_num_workers 8 \
|
136 |
+
--report_to none \
|
137 |
+
--per_device_train_batch_size 2 \
|
138 |
+
--num_train_epochs 1 \
|
139 |
+
--learning_rate 2e-5 \
|
140 |
+
--datasets pdf-ocr+scence \
|
141 |
+
--output_dir /your/output.path
|
142 |
+
```
|
143 |
+
**Note**:
|
144 |
+
1. Change the corresponding data information in constant.py.
|
145 |
+
2. Change line 37 in conversation_dataset_qwen.py to your data_name.
|
146 |
+
|
147 |
+
|
148 |
+
## Eval
|
149 |
+
1. We use the [Fox](https://github.com/ucaslcl/Fox) and [OneChart](https://github.com/LingyvKong/OneChart) benchmarks, and other benchmarks can be found in the weights download link.
|
150 |
+
2. The eval codes can be found in GOT/eval.
|
151 |
+
3. You can use the evaluate_GOT.py to run the eval. If you have 8 GPUs, the --num-chunks can be set to 8.
|
152 |
+
```Shell
|
153 |
+
python3 GOT/eval/evaluate_GOT.py --model-name /GOT_weights/ --gtfile_path xxxx.json --image_path /image/path/ --out_path /data/eval_results/GOT_mathpix_test/ --num-chunks 8 --datatype OCR
|
154 |
+
```
|
155 |
+
|
156 |
+
## Contact
|
157 |
+
If you are interested in this work or have questions about the code or the paper, please join our communication [Wechat]() group.
|
158 |
+
|
159 |
+
## Acknowledgement
|
160 |
+
- [Vary](https://github.com/Ucas-HaoranWei/Vary/): the codebase we built upon!
|
161 |
+
- [Qwen](https://github.com/QwenLM/Qwen): the LLM base model of Vary, which is good at both English and Chinese!
|
162 |
+
|
163 |
+
|
164 |
+
## Citation
|
165 |
+
```bibtex
|
166 |
+
@article{wei2024general,
|
167 |
+
title={General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model},
|
168 |
+
author={Wei, Haoran and Liu, Chenglong and Chen, Jinyue and Wang, Jia and Kong, Lingyu and Xu, Yanming and Ge, Zheng and Zhao, Liang and Sun, Jianjian and Peng, Yuang and others},
|
169 |
+
journal={arXiv preprint arXiv:2409.01704},
|
170 |
+
year={2024}
|
171 |
+
}
|
172 |
+
@article{wei2023vary,
|
173 |
+
title={Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models},
|
174 |
+
author={Wei, Haoran and Kong, Lingyu and Chen, Jinyue and Zhao, Liang and Ge, Zheng and Yang, Jinrong and Sun, Jianjian and Han, Chunrui and Zhang, Xiangyu},
|
175 |
+
journal={arXiv preprint arXiv:2312.06109},
|
176 |
+
year={2023}
|
177 |
+
}
|