Visual Question Answering
Transformers
Safetensors
llava
image-text-to-text
AIGC
LLaVA
Inference Endpoints
ponytail commited on
Commit
03e227f
Β·
verified Β·
1 Parent(s): df3e4f9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -6
README.md CHANGED
@@ -24,16 +24,17 @@ pipeline_tag: visual-question-answering
24
 
25
  Human-related vision and language tasks are widely applied across various social scenarios. The latest studies demonstrate that the large vision-language model can enhance the performance of various downstream tasks in visual-language understanding. Since, models in the general domain often not perform well in the specialized field. In this study, we train a domain-specific Large Language-Vision model, Human-LLaVA, which aim to construct an unified multimodal Language-Vision Model for Human-related tasks.
26
 
27
- Specifically, (1) we first construct a large-scale and high-quality human-related image-text (caption) dataset extracted from Internet for domain-specific alignment in the first stage (Coming soon); (2) we also propose to construct a multi-granularity caption for human-related images (Coming soon), including human face, human body, and whole image, thereby fine-tuning a large language model. Lastly, we evaluate our model on a series of downstream tasks, our Human-LLaVA achieved the best overall performance among multimodal models of similar scale. In particular, it exhibits the best performance in a series of human-related tasks, significantly surpassing similar models and ChatGPT-4o. We believe that the Huaman-LLaVA model and a series of datasets presented in this work can promote research in related fields.
28
 
29
 
30
  ## Result
31
 
32
  ## News and Update πŸ”₯πŸ”₯πŸ”₯
33
- * 2024.09.08 **πŸ€—[Human_LLaVA_8B](https://huggingface.co/OpenFace-CQUPT/Human_LLaVA), is released!πŸ‘πŸ‘πŸ‘**
34
 
35
 
36
- ## How to Use
 
37
  ``` python
38
  import requests
39
  from PIL import Image
@@ -67,6 +68,7 @@ print(predict)
67
  HumanCaption-10M(self construct): Coming Soon!
68
 
69
  #### Instruction Tuning Stage
 
70
 
71
  HumanCaptionHQ-300K(self construct): Coming Soon!
72
 
@@ -76,13 +78,12 @@ humanvg_high_reg(self construct):Coming Soon!
76
 
77
  humanvg_high_rec(self construct):Coming Soon!
78
 
79
- celeba_attribute(self construct):
80
 
81
- ShareGPT4V_caption:
82
 
83
  LLaVA-Instruct_zh :
84
 
85
- ShareGPT4V_vqa:
86
 
87
  verified_ref3rec:
88
 
 
24
 
25
  Human-related vision and language tasks are widely applied across various social scenarios. The latest studies demonstrate that the large vision-language model can enhance the performance of various downstream tasks in visual-language understanding. Since, models in the general domain often not perform well in the specialized field. In this study, we train a domain-specific Large Language-Vision model, Human-LLaVA, which aim to construct an unified multimodal Language-Vision Model for Human-related tasks.
26
 
27
+ Specifically, (1) we first construct **a large-scale and high-quality human-related image-text (caption) dataset** extracted from Internet for domain-specific alignment in the first stage (Coming soon); (2) we also propose to construct **a multi-granularity caption for human-related images** (Coming soon), including human face, human body, and whole image, thereby fine-tuning a large language model. Lastly, we evaluate our model on a series of downstream tasks, our **Human-LLaVA** achieved the best overall performance among multimodal models of similar scale. In particular, it exhibits the best performance in a series of human-related tasks, significantly surpassing similar models and ChatGPT-4o. We believe that the Huaman-LLaVA model and a series of datasets presented in this work can promote research in related fields.
28
 
29
 
30
  ## Result
31
 
32
  ## News and Update πŸ”₯πŸ”₯πŸ”₯
33
+ * Sep.8, 2024. **πŸ€—[Human-LLaVA-8B](https://huggingface.co/OpenFace-CQUPT/Human_LLaVA), is released!πŸ‘πŸ‘πŸ‘**
34
 
35
 
36
+ ## πŸ€— Transformers
37
+ To use Human-LLaVA for the inference, all you need to do is to input a few lines of codes as demonstrated below. However, please make sure that you are using latest code.
38
  ``` python
39
  import requests
40
  from PIL import Image
 
68
  HumanCaption-10M(self construct): Coming Soon!
69
 
70
  #### Instruction Tuning Stage
71
+ All public data sets have been filtered, and we will consider publishing all processed text in the future
72
 
73
  HumanCaptionHQ-300K(self construct): Coming Soon!
74
 
 
78
 
79
  humanvg_high_rec(self construct):Coming Soon!
80
 
81
+ celeba_attribute(self construct): [CelebA](https://mmlab.ie.cuhk.edu.hk/projects/CelebA.html)
82
 
83
+ ShareGPT4V:[ShareGPT4V]https://github.com/InternLM/InternLM-XComposer/blob/main/projects/ShareGPT4V/docs/Data.md
84
 
85
  LLaVA-Instruct_zh :
86
 
 
87
 
88
  verified_ref3rec:
89