OpenFace-CQUPT
/

Human_LLaVA

@@ -8,6 +8,10 @@ tags:
 ---
 # Human-LLaVA-(HumanCaption-10M dataset)
 ### Introduction
 Human-related vision and language tasks are widely applied across various social scenarios.  The latest studies demonstrate that the large vision-language model can enhance the performance of various downstream tasks in visual-language understanding.  Since, models in the general domain often not perform well in the specialized field.  In this study, we train a domain-specific Large Language-Vision model, Human-LLaVA, which aim to construct an unified multimodal Language-Vision Model for Human-related tasks.
@@ -15,18 +19,11 @@ Human-related vision and language tasks are widely applied across various social
 Specifically, (1) we first construct a large-scale and high-quality human-related image-text (caption) dataset extracted from Internet for domain-specific alignment in the first stage (Coming soon);  (2) we also propose to construct a multi-granularity caption for human-related images (Coming soon), including human face, human body, and whole image, thereby fine-tuning a large language model.  Lastly, we evaluate our model on a series of downstream tasks, our Human-LLaVA achieved the best overall performance among multimodal models of similar scale.  In particular, it exhibits the best performance in a series of human-related tasks, significantly surpassing similar models and ChatGPT-4o.  We believe that the Huaman-LLaVA model and a series of datasets presented in this work can promote research in related fields.
-## DEMO
-<video controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/64259db7d3e6fdf87e4792d0/tyT9FvycyyVWISd1-_A-m.mp4"></video>
 ## Result
 ## How to Use
 ``` python
 import requests

 ---
 # Human-LLaVA-(HumanCaption-10M dataset)
+## DEMO
+<video controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/64259db7d3e6fdf87e4792d0/tyT9FvycyyVWISd1-_A-m.mp4"></video>
 ### Introduction
 Human-related vision and language tasks are widely applied across various social scenarios.  The latest studies demonstrate that the large vision-language model can enhance the performance of various downstream tasks in visual-language understanding.  Since, models in the general domain often not perform well in the specialized field.  In this study, we train a domain-specific Large Language-Vision model, Human-LLaVA, which aim to construct an unified multimodal Language-Vision Model for Human-related tasks.
 Specifically, (1) we first construct a large-scale and high-quality human-related image-text (caption) dataset extracted from Internet for domain-specific alignment in the first stage (Coming soon);  (2) we also propose to construct a multi-granularity caption for human-related images (Coming soon), including human face, human body, and whole image, thereby fine-tuning a large language model.  Lastly, we evaluate our model on a series of downstream tasks, our Human-LLaVA achieved the best overall performance among multimodal models of similar scale.  In particular, it exhibits the best performance in a series of human-related tasks, significantly surpassing similar models and ChatGPT-4o.  We believe that the Huaman-LLaVA model and a series of datasets presented in this work can promote research in related fields.
 ## Result
 ## How to Use
 ``` python
 import requests