ddw2AIGROUP2CQUPT
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -8,6 +8,10 @@ tags:
|
|
8 |
---
|
9 |
# Human-LLaVA-(HumanCaption-10M dataset)
|
10 |
|
|
|
|
|
|
|
|
|
11 |
### Introduction
|
12 |
|
13 |
Human-related vision and language tasks are widely applied across various social scenarios. The latest studies demonstrate that the large vision-language model can enhance the performance of various downstream tasks in visual-language understanding. Since, models in the general domain often not perform well in the specialized field. In this study, we train a domain-specific Large Language-Vision model, Human-LLaVA, which aim to construct an unified multimodal Language-Vision Model for Human-related tasks.
|
@@ -15,18 +19,11 @@ Human-related vision and language tasks are widely applied across various social
|
|
15 |
Specifically, (1) we first construct a large-scale and high-quality human-related image-text (caption) dataset extracted from Internet for domain-specific alignment in the first stage (Coming soon); (2) we also propose to construct a multi-granularity caption for human-related images (Coming soon), including human face, human body, and whole image, thereby fine-tuning a large language model. Lastly, we evaluate our model on a series of downstream tasks, our Human-LLaVA achieved the best overall performance among multimodal models of similar scale. In particular, it exhibits the best performance in a series of human-related tasks, significantly surpassing similar models and ChatGPT-4o. We believe that the Huaman-LLaVA model and a series of datasets presented in this work can promote research in related fields.
|
16 |
|
17 |
|
18 |
-
## DEMO
|
19 |
-
|
20 |
-
<video controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/64259db7d3e6fdf87e4792d0/tyT9FvycyyVWISd1-_A-m.mp4"></video>
|
21 |
-
|
22 |
-
|
23 |
## Result
|
24 |
|
25 |
|
26 |
|
27 |
|
28 |
-
|
29 |
-
|
30 |
## How to Use
|
31 |
``` python
|
32 |
import requests
|
|
|
8 |
---
|
9 |
# Human-LLaVA-(HumanCaption-10M dataset)
|
10 |
|
11 |
+
## DEMO
|
12 |
+
|
13 |
+
<video controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/64259db7d3e6fdf87e4792d0/tyT9FvycyyVWISd1-_A-m.mp4"></video>
|
14 |
+
|
15 |
### Introduction
|
16 |
|
17 |
Human-related vision and language tasks are widely applied across various social scenarios. The latest studies demonstrate that the large vision-language model can enhance the performance of various downstream tasks in visual-language understanding. Since, models in the general domain often not perform well in the specialized field. In this study, we train a domain-specific Large Language-Vision model, Human-LLaVA, which aim to construct an unified multimodal Language-Vision Model for Human-related tasks.
|
|
|
19 |
Specifically, (1) we first construct a large-scale and high-quality human-related image-text (caption) dataset extracted from Internet for domain-specific alignment in the first stage (Coming soon); (2) we also propose to construct a multi-granularity caption for human-related images (Coming soon), including human face, human body, and whole image, thereby fine-tuning a large language model. Lastly, we evaluate our model on a series of downstream tasks, our Human-LLaVA achieved the best overall performance among multimodal models of similar scale. In particular, it exhibits the best performance in a series of human-related tasks, significantly surpassing similar models and ChatGPT-4o. We believe that the Huaman-LLaVA model and a series of datasets presented in this work can promote research in related fields.
|
20 |
|
21 |
|
|
|
|
|
|
|
|
|
|
|
22 |
## Result
|
23 |
|
24 |
|
25 |
|
26 |
|
|
|
|
|
27 |
## How to Use
|
28 |
``` python
|
29 |
import requests
|