|
--- |
|
license: llama3 |
|
base_model: meta-llama/Meta-Llama-3-8B-Instruct |
|
library_name: transformers |
|
tags: |
|
- AIGC |
|
- LLaVA |
|
datasets: |
|
- OpenFace-CQUPT/FaceCaption-15M |
|
metrics: |
|
- accuracy |
|
pipeline_tag: visual-question-answering |
|
--- |
|
# Human-LLaVA-8B |
|
|
|
## DEMO |
|
|
|
|
|
<video controls autoplay src="/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F64259db7d3e6fdf87e4792d0%2FTpN2t19Poe5YbHHP8uN7_.mp4%26quot%3B%3C%2Fspan%3E%26gt%3B%3C%2Fspan%3E%3C%2Fspan%3E%3Cspan class="language-xml"></video> |
|
|
|
|
|
![image/png](/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F64259db7d3e6fdf87e4792d0%2F1xS27bvECvGTKntvOa1SQ.png%3C%2Fspan%3E)%3C!-- HTML_TAG_END --> |
|
|
|
### Introduction |
|
|
|
Human-related vision and language tasks are widely applied across various social scenarios. The latest studies demonstrate that the large vision-language model can enhance the performance of various downstream tasks in visual-language understanding. Since, models in the general domain often not perform well in the specialized field. In this study, we train a domain-specific Large Language-Vision model, Human-LLaVA, which aim to construct an unified multimodal Language-Vision Model for Human-related tasks. |
|
|
|
Specifically, (1) we first construct **a large-scale and high-quality human-related image-text (caption) dataset** extracted from Internet for domain-specific alignment in the first stage (Coming soon); (2) we also propose to construct **a multi-granularity caption for human-related images** (Coming soon), including human face, human body, and whole image, thereby fine-tuning a large language model. Lastly, we evaluate our model on a series of downstream tasks, our **Human-LLaVA** achieved the best overall performance among multimodal models of similar scale. In particular, it exhibits the best performance in a series of human-related tasks, significantly surpassing similar models and ChatGPT-4o. We believe that the Huaman-LLaVA model and a series of datasets presented in this work can promote research in related fields. |
|
|
|
|
|
## Result |
|
human-llava has a good performance in both general and special fields |
|
|
|
|
|
![image/png](/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F64259db7d3e6fdf87e4792d0%2FX-712oVUBPXbfLcAz83fb.png%3C%2Fspan%3E)%3C!-- HTML_TAG_END --> |
|
|
|
## News and Update π₯π₯π₯ |
|
* Oct.23, 2024. **π€[HumanCaption-HQ-311K](https://huggingface.co/datasets/OpenFace-CQUPT/HumanCaption-HQ-311K), is released!πππ** |
|
* Sep.12, 2024. **π€[HumanCaption-10M](https://huggingface.co/datasets/OpenFace-CQUPT/HumanCaption-10M), is released!πππ** |
|
* Sep.8, 2024. **π€[HumanVLM](https://huggingface.co/OpenFace-CQUPT/Human_LLaVA), is released!πππ** |
|
|
|
|
|
|
|
## π€ Transformers |
|
To use Human-LLaVA for the inference, all you need to do is to input a few lines of codes as demonstrated below. However, please make sure that you are using latest code. |
|
``` python |
|
import requests |
|
from PIL import Image |
|
|
|
import torch |
|
from transformers import AutoProcessor, AutoModelForPreTraining |
|
|
|
|
|
model_id = "OpenFace-CQUPT/Human_LLaVA" |
|
cuda = 0 |
|
model = AutoModelForPreTraining.from_pretrained("OpenFace-CQUPT/Human_LLaVA", torch_dtype=torch.float16).to(cuda) |
|
|
|
processor = AutoProcessor.from_pretrained(model_id,trust_remote_code=True) |
|
|
|
|
|
text = "Please describe this picture" |
|
prompt = "USER: <image>\n" + text + "\nASSISTANT:" |
|
image_file = "./test1.jpg" |
|
raw_image = Image.open(image_file) |
|
# raw_image = Image.open(requests.get(image_file, stream=True).raw) |
|
inputs = processor(images=raw_image, text=prompt, return_tensors='pt').to(cuda, torch.float16) |
|
|
|
output = model.generate(**inputs, max_new_tokens=400, do_sample=False) |
|
predict = processor.decode(output[0][:], skip_special_tokens=True) |
|
print(predict) |
|
``` |
|
|
|
Our training code have been released publicly on github.[ddw2AIGROUP2CQUPT/Human-LLaVA-8B(github.com)](https://github.com/ddw2AIGROUP2CQUPT/Human-LLaVA-8B) |
|
## Get the Dataset |
|
#### Dataset Example |
|
|
|
![image/png](/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F64259db7d3e6fdf87e4792d0%2F-gTV7ym_gmNmJqNRDzlCx.png%3C%2Fspan%3E)%3C!-- HTML_TAG_END --> |
|
|
|
#### Domain Alignment Stage |
|
[HumanCaption-10M](https://huggingface.co/datasets/OpenFace-CQUPT/HumanCaption-10M)(self construct): is released! |
|
|
|
#### Instruction Tuning Stage |
|
**All public data sets have been filtered, and we will consider publishing all processed text in the future** |
|
|
|
[HumanCaption-HQ](https://huggingface.co/datasets/OpenFace-CQUPT/HumanCaption-HQ-311K)(self construct): is released! |
|
|
|
[FaceCaptionA](https://huggingface.co/datasets/OpenFace-CQUPT/FaceCaption-15M)(self construct): is released! |
|
|
|
CelebA: https://mmlab.ie.cuhk.edu.hk/projects/CelebA.html |
|
|
|
ShareGPT4V:https://github.com/InternLM/InternLM-XComposer/blob/main/projects/ShareGPT4V/docs/Data.md |
|
|
|
LLaVA-Instruct_zh : https://huggingface.co/datasets/openbmb/llava_zh |
|
|
|
verified_ref3rec: https://huggingface.co/datasets/lucasjin/refcoco/blob/main/ref3rec.json |
|
|
|
verified_ref3reg: https://huggingface.co/datasets/lucasjin/refcoco/blob/main/ref3rec.json |
|
|
|
verified_shikra: https://github.com/shikras/shikra |
|
|
|
|
|
|
|
## Citation |
|
|
|
``` |
|
@misc{dai2024humanvlmfoundationhumanscenevisionlanguage, |
|
title={HumanVLM: Foundation for Human-Scene Vision-Language Model}, |
|
author={Dawei Dai and Xu Long and Li Yutang and Zhang Yuanhui and Shuyin Xia}, |
|
year={2024}, |
|
eprint={2411.03034}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.AI}, |
|
url={https://arxiv.org/abs/2411.03034}, |
|
} |
|
``` |
|
|
|
## contact |
|
|
|
mailto: [[email protected]](mailto:[email protected]) or [[email protected]](mailto:[email protected]) |