File size: 5,368 Bytes
af8875a c095e03 af8875a c095e03 40a1b9b bf9a2b9 32e58f4 914a564 32e58f4 bee6755 9fa9b43 5c20ae1 bee6755 32e58f4 03e227f 32e58f4 9759053 e87538b 4d156e4 605edd6 f5426be 32e58f4 aba1ea0 7ccbc0f 93e0bf8 a112ff5 7ccbc0f 32e58f4 03e227f 9759053 32e58f4 9759053 2d18430 32e58f4 81361c1 9759053 2d18430 32e58f4 9759053 9f9169a 3d262b5 9759053 8a66cc5 9888832 32e58f4 9fa9b43 ecc8df9 32e58f4 d1cfd1f 32e58f4 40a1b9b 32e58f4 f5426be 743894a f5426be 991ed02 f5426be 743894a 2068c4e 743894a 6ac7bbc 743894a 42f234e 743894a 42f234e 743894a 42f234e fac7467 743894a 32e58f4 a70b13c 32e58f4 bf9a2b9 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 |
---
license: llama3
base_model: meta-llama/Meta-Llama-3-8B-Instruct
library_name: transformers
tags:
- AIGC
- LLaVA
datasets:
- OpenFace-CQUPT/FaceCaption-15M
metrics:
- accuracy
pipeline_tag: visual-question-answering
---
# Human-LLaVA-8B
## DEMO
<video controls autoplay src="/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F64259db7d3e6fdf87e4792d0%2FTpN2t19Poe5YbHHP8uN7_.mp4%26quot%3B%3C%2Fspan%3E%26gt%3B%3C%2Fspan%3E%3C%2Fspan%3E%3Cspan class="language-xml"></video>
![image/png](/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F64259db7d3e6fdf87e4792d0%2F1xS27bvECvGTKntvOa1SQ.png%3C%2Fspan%3E)
### Introduction
Human-related vision and language tasks are widely applied across various social scenarios. The latest studies demonstrate that the large vision-language model can enhance the performance of various downstream tasks in visual-language understanding. Since, models in the general domain often not perform well in the specialized field. In this study, we train a domain-specific Large Language-Vision model, Human-LLaVA, which aim to construct an unified multimodal Language-Vision Model for Human-related tasks.
Specifically, (1) we first construct **a large-scale and high-quality human-related image-text (caption) dataset** extracted from Internet for domain-specific alignment in the first stage (Coming soon); (2) we also propose to construct **a multi-granularity caption for human-related images** (Coming soon), including human face, human body, and whole image, thereby fine-tuning a large language model. Lastly, we evaluate our model on a series of downstream tasks, our **Human-LLaVA** achieved the best overall performance among multimodal models of similar scale. In particular, it exhibits the best performance in a series of human-related tasks, significantly surpassing similar models and ChatGPT-4o. We believe that the Huaman-LLaVA model and a series of datasets presented in this work can promote research in related fields.
## Result
human-llava has a good performance in both general and special fields
![image/png](/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F64259db7d3e6fdf87e4792d0%2FX-712oVUBPXbfLcAz83fb.png%3C%2Fspan%3E)
## News and Update π₯π₯π₯
* Oct.23, 2024. **π€[HumanCaption-HQ-311K](https://huggingface.co/datasets/OpenFace-CQUPT/HumanCaption-HQ-311K), is released!πππ**
* Sep.12, 2024. **π€[HumanCaption-10M](https://huggingface.co/datasets/OpenFace-CQUPT/HumanCaption-10M), is released!πππ**
* Sep.8, 2024. **π€[HumanVLM](https://huggingface.co/OpenFace-CQUPT/Human_LLaVA), is released!πππ**
## π€ Transformers
To use Human-LLaVA for the inference, all you need to do is to input a few lines of codes as demonstrated below. However, please make sure that you are using latest code.
``` python
import requests
from PIL import Image
import torch
from transformers import AutoProcessor, AutoModelForPreTraining
model_id = "OpenFace-CQUPT/Human_LLaVA"
cuda = 0
model = AutoModelForPreTraining.from_pretrained("OpenFace-CQUPT/Human_LLaVA", torch_dtype=torch.float16).to(cuda)
processor = AutoProcessor.from_pretrained(model_id,trust_remote_code=True)
text = "Please describe this picture"
prompt = "USER: <image>\n" + text + "\nASSISTANT:"
image_file = "./test1.jpg"
raw_image = Image.open(image_file)
# raw_image = Image.open(requests.get(image_file, stream=True).raw)
inputs = processor(images=raw_image, text=prompt, return_tensors='pt').to(cuda, torch.float16)
output = model.generate(**inputs, max_new_tokens=400, do_sample=False)
predict = processor.decode(output[0][:], skip_special_tokens=True)
print(predict)
```
Our training code have been released publicly on github.[ddw2AIGROUP2CQUPT/Human-LLaVA-8B(github.com)](https://github.com/ddw2AIGROUP2CQUPT/Human-LLaVA-8B)
## Get the Dataset
#### Dataset Example
![image/png](/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F64259db7d3e6fdf87e4792d0%2F-gTV7ym_gmNmJqNRDzlCx.png%3C%2Fspan%3E)
#### Domain Alignment Stage
[HumanCaption-10M](https://huggingface.co/datasets/OpenFace-CQUPT/HumanCaption-10M)(self construct): is released!
#### Instruction Tuning Stage
**All public data sets have been filtered, and we will consider publishing all processed text in the future**
[HumanCaption-HQ](https://huggingface.co/datasets/OpenFace-CQUPT/HumanCaption-HQ-311K)(self construct): is released!
[FaceCaptionA](https://huggingface.co/datasets/OpenFace-CQUPT/FaceCaption-15M)(self construct): is released!
CelebA: https://mmlab.ie.cuhk.edu.hk/projects/CelebA.html
ShareGPT4V:https://github.com/InternLM/InternLM-XComposer/blob/main/projects/ShareGPT4V/docs/Data.md
LLaVA-Instruct_zh : https://huggingface.co/datasets/openbmb/llava_zh
verified_ref3rec: https://huggingface.co/datasets/lucasjin/refcoco/blob/main/ref3rec.json
verified_ref3reg: https://huggingface.co/datasets/lucasjin/refcoco/blob/main/ref3rec.json
verified_shikra: https://github.com/shikras/shikra
## Citation
```
@misc{dai2024humanvlmfoundationhumanscenevisionlanguage,
title={HumanVLM: Foundation for Human-Scene Vision-Language Model},
author={Dawei Dai and Xu Long and Li Yutang and Zhang Yuanhui and Shuyin Xia},
year={2024},
eprint={2411.03034},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2411.03034},
}
```
## contact
mailto: [[email protected]](mailto:[email protected]) or [[email protected]](mailto:[email protected]) |