Visual Question Answering
Transformers
Safetensors
llava
image-text-to-text
AIGC
LLaVA
Inference Endpoints
ddw2AIGROUP2CQUPT commited on
Commit
bee6755
·
verified ·
1 Parent(s): 9352092

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -7
README.md CHANGED
@@ -8,6 +8,10 @@ tags:
8
  ---
9
  # Human-LLaVA-(HumanCaption-10M dataset)
10
 
 
 
 
 
11
  ### Introduction
12
 
13
  Human-related vision and language tasks are widely applied across various social scenarios. The latest studies demonstrate that the large vision-language model can enhance the performance of various downstream tasks in visual-language understanding. Since, models in the general domain often not perform well in the specialized field. In this study, we train a domain-specific Large Language-Vision model, Human-LLaVA, which aim to construct an unified multimodal Language-Vision Model for Human-related tasks.
@@ -15,18 +19,11 @@ Human-related vision and language tasks are widely applied across various social
15
  Specifically, (1) we first construct a large-scale and high-quality human-related image-text (caption) dataset extracted from Internet for domain-specific alignment in the first stage (Coming soon); (2) we also propose to construct a multi-granularity caption for human-related images (Coming soon), including human face, human body, and whole image, thereby fine-tuning a large language model. Lastly, we evaluate our model on a series of downstream tasks, our Human-LLaVA achieved the best overall performance among multimodal models of similar scale. In particular, it exhibits the best performance in a series of human-related tasks, significantly surpassing similar models and ChatGPT-4o. We believe that the Huaman-LLaVA model and a series of datasets presented in this work can promote research in related fields.
16
 
17
 
18
- ## DEMO
19
-
20
- <video controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/64259db7d3e6fdf87e4792d0/tyT9FvycyyVWISd1-_A-m.mp4"></video>
21
-
22
-
23
  ## Result
24
 
25
 
26
 
27
 
28
-
29
-
30
  ## How to Use
31
  ``` python
32
  import requests
 
8
  ---
9
  # Human-LLaVA-(HumanCaption-10M dataset)
10
 
11
+ ## DEMO
12
+
13
+ <video controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/64259db7d3e6fdf87e4792d0/tyT9FvycyyVWISd1-_A-m.mp4"></video>
14
+
15
  ### Introduction
16
 
17
  Human-related vision and language tasks are widely applied across various social scenarios. The latest studies demonstrate that the large vision-language model can enhance the performance of various downstream tasks in visual-language understanding. Since, models in the general domain often not perform well in the specialized field. In this study, we train a domain-specific Large Language-Vision model, Human-LLaVA, which aim to construct an unified multimodal Language-Vision Model for Human-related tasks.
 
19
  Specifically, (1) we first construct a large-scale and high-quality human-related image-text (caption) dataset extracted from Internet for domain-specific alignment in the first stage (Coming soon); (2) we also propose to construct a multi-granularity caption for human-related images (Coming soon), including human face, human body, and whole image, thereby fine-tuning a large language model. Lastly, we evaluate our model on a series of downstream tasks, our Human-LLaVA achieved the best overall performance among multimodal models of similar scale. In particular, it exhibits the best performance in a series of human-related tasks, significantly surpassing similar models and ChatGPT-4o. We believe that the Huaman-LLaVA model and a series of datasets presented in this work can promote research in related fields.
20
 
21
 
 
 
 
 
 
22
  ## Result
23
 
24
 
25
 
26
 
 
 
27
  ## How to Use
28
  ``` python
29
  import requests