|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- omkarthawakar/VRC-Bench |
|
- Xkev/LLaVA-CoT-100k |
|
language: |
|
- en |
|
base_model: |
|
- meta-llama/Llama-3.2-11B-Vision-Instruct |
|
pipeline_tag: question-answering |
|
--- |
|
|
|
|
|
## LlamaV-o1 |
|
|
|
<center><img src="logo2.png" alt="LlamaV-o1 logo" width="250"/></center> |
|
|
|
## Overview |
|
**LlamaV-o1** is an advanced multimodal large language model (LLM) designed for complex visual reasoning tasks. |
|
Built on a foundation of cutting-edge curriculum learning and optimized with techniques like Beam Search, |
|
LlamaV-o1 demonstrates exceptional performance across diverse benchmarks. |
|
It is fine-tuned for step-by-step reasoning, enabling it to tackle tasks in domains such as visual perception, |
|
mathematical reasoning, social and cultural contexts, medical imaging, and document understanding. |
|
|
|
The model is designed with a focus on interpretability and precision. By leveraging a structured reasoning approach, |
|
LlamaV-o1 provides coherent and accurate explanations for its decisions, making it an excellent tool for research |
|
and applications requiring high levels of reasoning. With over 4,000 manually verified reasoning steps in its benchmark evaluations, |
|
LlamaV-o1 sets a new standard for multimodal reasoning, delivering consistent and reliable results across challenging scenarios. |
|
|
|
### Key Features: |
|
- **Model Size:** 11 billion parameters. |
|
- **Architecture:** Based on the Llama (Large Language Model Architecture) family. |
|
- **Fine-Tuning:** Enhanced for instruction-following, chain-of-thought reasoning, and robust generalization across tasks. |
|
- **Applications:** Ideal for use cases such as conversational agents, educational tools, content creation, and more. |
|
--- |
|
## Model Details |
|
- **Developed By:** MBZUAI |
|
- **Model Version:** v0.1 |
|
- **Release Date:** 13th January 2025 |
|
- **Training Dataset:** Diverse multilingual corpus, including high-quality sources for instruction tuning, chain-of-thought datasets, and general-purpose corpora. |
|
- **Framework:** Pytorch |
|
--- |
|
|
|
## Intended Use |
|
**LlamaV-o1** is designed for a wide range of NLP tasks, including but not limited to: |
|
- Text Generation |
|
- Sentiment Analysis |
|
- Text Summarization |
|
- Question Answering |
|
- Chain-of-Thought Reasoning |
|
|
|
### Out-of-Scope Use |
|
The model should not be used in applications requiring high-stakes decision-making, such as healthcare diagnosis, financial predictions, or any scenarios involving potential harm. |
|
--- |
|
|
|
## Training Procedure |
|
- **Fine-Tuning:** The model was fine-tuned on a dataset optimized for reasoning, coherence, and diversity, leveraging instruction-tuning techniques to enhance usability in downstream applications. |
|
- **Optimizations:** Includes inference scaling optimizations to balance performance and computational efficiency. |
|
--- |
|
## Evaluation |
|
|
|
### Benchmarks |
|
LlamaV-o1 has been evaluated on a suite of benchmark tasks: |
|
- **Reasoning:** [VCR-Bench](https://huggingface.co/datasets/omkarthawakar/VRC-Bench) |
|
|
|
|
|
### Limitations |
|
While the model performs well on a broad range of tasks, it may struggle with: |
|
- Highly technical, domain-specific knowledge outside the training corpus. |
|
- Generating accurate outputs for ambiguous or adversarial prompts. |
|
--- |
|
## Usage |
|
```python |
|
from transformers import MllamaForConditionalGeneration, AutoProcessor |
|
|
|
model_id = "omkarthawakar/LlamaV-o1" |
|
|
|
model = MllamaForConditionalGeneration.from_pretrained( |
|
model_id, |
|
torch_dtype=torch.bfloat16, |
|
device_map="auto", |
|
) |
|
processor = AutoProcessor.from_pretrained(model_id) |
|
``` |
|
|
|
Please refer to [llamav-o1.py](https://github.com/mbzuai-oryx/LlamaV-o1/blob/main/eval/llamav-o1.py) for inference. |
|
|
|
### Results |
|
**Table 1:** Comparison of models based on Final Answer accuracy and Reasoning Steps performance on the proposed VRC-Bench. The best results in each case (closed-source and open-source) are in bold. Our LlamaV-o1 achieves superior performance compared to its open-source counterpart (Llava-CoT) while also being competitive against the closed-source models. |
|
|
|
| **Model** | **GPT-4o** | **Claude-3.5** | **Gemini-2.0** | **Gemini-1.5 Pro** | **Gemini-1.5 Flash** | **GPT-4o Mini** | **Llama-3.2 Vision** | **Mulberry** | **Llava-CoT** | **LlamaV-o1 (Ours)** | |
|
|-------------|------------|----------------|----------------|-------------------|--------------------|----------------|--------------------|-------------|--------------|-------------------| |
|
| **Final Answer** | 59.28 | **61.35** | 61.16 | **61.35** | 54.99 | 56.39 | 48.40 | 51.90 | 54.09 | **56.49** | |
|
| **Reasoning Steps** | **76.68** | 72.12 | 74.08 | 72.12 | 71.86 | 74.05 | 58.37 | 63.86 | 66.21 | **68.93** | |
|
--- |
|
|
|
### Training Data |
|
|
|
LlamaV-o1 is trained on the [LLaVA-CoT-100k dataset](https://huggingface.co/datasets/Xkev/LLaVA-CoT-100k). |
|
We have formatted training sample for multi-step reasoning. |
|
|
|
### Training Procedure |
|
|
|
LlamaV-o1 model is finetuned on [llama-recipes](https://github.com/Meta-Llama/llama-recipes). |
|
Detailed Training procedure will be coming soon! |
|
|
|
### Citation |
|
Coming Soon! |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|