Image-Text-to-Text
PyTorch
English
llava
zliubot commited on
Commit
6f6ee98
·
verified ·
1 Parent(s): c1dd284

Upload 4 files

Browse files
Files changed (4) hide show
  1. README.md +81 -0
  2. config.json +40 -0
  3. generation_config.json +9 -0
  4. gitattributes +35 -0
README.md ADDED
@@ -0,0 +1,81 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ datasets:
3
+ - SpursgoZmy/MMTab
4
+ - liuhaotian/LLaVA-Instruct-150K
5
+ - liuhaotian/LLaVA-Pretrain
6
+ language:
7
+ - en
8
+ metrics:
9
+ - accuracy
10
+ - bleu
11
+ - f1
12
+ pipeline_tag: image-text-to-text
13
+ ---
14
+ # Table LLaVA Model Card
15
+
16
+ <!-- Provide a quick summary of what the model is/does. -->
17
+
18
+ Table LLaVA 7B is an open-source multimodal chatbot for understanding different table images and fulfilling diverse table-related requests, e.g., question answering, table cell description and structure understanding.
19
+
20
+ See the ACL 2024 paper for more details: [Multimodal Table Understanding](https://arxiv.org/abs/2406.08100)
21
+
22
+ ## Model Details
23
+
24
+ <!-- Provide a longer summary of what this model is. -->
25
+
26
+ **Model Type:** Table LLaVA 7B strictly follows the [LLaVA-v1.5](https://arxiv.org/abs/2310.03744) model architecture and training pipeline,
27
+ with [CLIP-ViT-L-336px](https://huggingface.co/openai/clip-vit-large-patch14-336) as visual encoder (336*336 image resolution),
28
+ [Vicuna-v1.5-7B](https://huggingface.co/lmsys/vicuna-7b-v1.5) as base LLM and a two-layer MLP as vision-language connector.
29
+
30
+ It was trained with a two-stage pipeline as LLaVA:
31
+
32
+ 1. Pre-training: train the vision-language connector with image-caption data and table recognition data.
33
+ 2. Instruction tuning: train the vision-language connector and the base LLM with multimodal instruction following data of tabular and non-tabular tasks.
34
+
35
+ **Code Base:** We use the official code of [LLaVA-v1.5](https://github.com/haotian-liu/LLaVA) for model training and inference,
36
+ and the saved model checkpoint is uploaded to this repository. Thus, Table LLaVA can be used in the same way as the normal LLaVA v1.5 model with its original code.
37
+
38
+ **Model Date:** Table-LLaVA 7B was trained in January 2024.
39
+
40
+ **Where to send questions or comments about the model:** https://github.com/SpursGoZmy/Table-LLaVA/issues
41
+
42
+ ## Training dataset
43
+
44
+ The training data includes original LLaVA-1.5 data and specially constructed
45
+ multimodal instruction-following data from the [MMTab dataset](https://huggingface.co/datasets/SpursgoZmy/MMTab),
46
+ which is a large-scale dataset covering a wide range of table images and table-related tasks.
47
+
48
+ | Training Stage | Data Description | Data Size | Hugging Face Dataset |
49
+ | :---: | :---: | :---: | :---: |
50
+ | Pre-training | 558K original LLaVA-1.5 pre-training data | 558K | [blip_laion_cc_sbu_558k.json](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain) |
51
+ | | 150K table recognition data | 150K | [MMTab-pre_pretrain_data_llava_format_150K.json](https://huggingface.co/datasets/SpursgoZmy/MMTab) |
52
+ | Instruction Fine-tuning | 665K original LLaVA-1.5 fine-tuning data | 665K | [llava_v1_5_mix665k.json](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K) |
53
+ | | 232K multimodal instruction tuning data of 14 tabular tasks | 232K | [MMTab-instruct_sft_data_llava_format_232K.json](https://huggingface.co/datasets/SpursgoZmy/MMTab) |
54
+
55
+ We also provide the merged pre-training and instruction fine-tuning data in the MMTab dataset,
56
+ i.e., enhanced_llava_pretrain_data_708K.json and enhanced_llava_sft_data_898K.json, which was used to train Table LLaVA.
57
+
58
+ ## Evaluation dataset
59
+
60
+ A collection of 17 held-in and 7 held-out tabular benchmarks, including 15 table-related tasks, e.g., table question answering and table2text generation.
61
+ We also evaluate Table LLaVA on two non-tabular benchmarks:
62
+ [TextVQA](https://textvqa.org/) and [llava-bench-in-the-wild](https://huggingface.co/datasets/liuhaotian/llava-bench-in-the-wild).
63
+
64
+ ## License
65
+
66
+ Table LLaVA is based on LLaVA-1.5 and thus follows its license. Llama 2 is licensed under the LLAMA 2 Community License, Copyright (c) Meta Platforms, Inc. All Rights Reserved.
67
+
68
+ ## Intended use
69
+
70
+ **Primary intended uses:** The primary use of Table LLaVA is research on large multimodal models and chatbots, especially for multimodal table understanding.
71
+
72
+ **Primary intended users:** The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.
73
+
74
+ ## Limitations
75
+
76
+ Table LLaVA takes one table image as the model input. Digesting multiple table images would be valuable to support more application scenarios. Though the proposed Table-LLaVA demonstrates
77
+ great performance on a wide range of table-based
78
+ tasks, the resolution of input images (336*336) is relatively
79
+ low and may limit the upper bound of its capacity. Luckily, with the emergence of MLLMs which
80
+ possess higher input image resolution (e.g., Monkey (Li et al., 2023d), LLaVA-Next (Liu et al.,
81
+ 2024)), researchers can use MMTab to develop more powerful tabular MLLM in the future research.
config.json ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "table-llava-v1.5-7b",
3
+ "architectures": [
4
+ "LlavaLlamaForCausalLM"
5
+ ],
6
+ "bos_token_id": 1,
7
+ "eos_token_id": 2,
8
+ "freeze_mm_mlp_adapter": false,
9
+ "hidden_act": "silu",
10
+ "hidden_size": 4096,
11
+ "image_aspect_ratio": "pad",
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 11008,
14
+ "max_position_embeddings": 4096,
15
+ "mm_hidden_size": 1024,
16
+ "mm_projector_lr": null,
17
+ "mm_projector_type": "mlp2x_gelu",
18
+ "mm_use_im_patch_token": false,
19
+ "mm_use_im_start_end": false,
20
+ "mm_vision_select_feature": "patch",
21
+ "mm_vision_select_layer": -2,
22
+ "mm_vision_tower": "openai/clip-vit-large-patch14-336",
23
+ "model_type": "llava",
24
+ "num_attention_heads": 32,
25
+ "num_hidden_layers": 32,
26
+ "num_key_value_heads": 32,
27
+ "pad_token_id": 0,
28
+ "pretraining_tp": 1,
29
+ "rms_norm_eps": 1e-05,
30
+ "rope_scaling": null,
31
+ "tie_word_embeddings": false,
32
+ "tokenizer_model_max_length": 2560,
33
+ "tokenizer_padding_side": "right",
34
+ "torch_dtype": "bfloat16",
35
+ "transformers_version": "4.31.0",
36
+ "tune_mm_mlp_adapter": false,
37
+ "use_cache": true,
38
+ "use_mm_proj": true,
39
+ "vocab_size": 32000
40
+ }
generation_config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token_id": 1,
3
+ "eos_token_id": 2,
4
+ "max_length": 4096,
5
+ "pad_token_id": 0,
6
+ "temperature": 0.9,
7
+ "top_p": 0.6,
8
+ "transformers_version": "4.31.0"
9
+ }
gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text