LlamaV-o1 / README.md

Update README.md

a67e98a verified 5 days ago

5.15 kB

	---
	license: apache-2.0
	datasets:
	- omkarthawakar/VRC-Bench
	- Xkev/LLaVA-CoT-100k
	language:
	- en
	base_model:
	- meta-llama/Llama-3.2-11B-Vision-Instruct
	pipeline_tag: question-answering
	---


	## LlamaV-o1

	<center><img src="logo2.png" alt="LlamaV-o1 logo" width="150"/></center>

	## Overview
	LlamaV-o1 is an advanced multimodal large language model (LLM) designed for complex visual reasoning tasks.
	Built on a foundation of cutting-edge curriculum learning and optimized with techniques like Beam Search,
	LlamaV-o1 demonstrates exceptional performance across diverse benchmarks.
	It is fine-tuned for step-by-step reasoning, enabling it to tackle tasks in domains such as visual perception,
	mathematical reasoning, social and cultural contexts, medical imaging, and document understanding.

	The model is designed with a focus on interpretability and precision. By leveraging a structured reasoning approach,
	LlamaV-o1 provides coherent and accurate explanations for its decisions, making it an excellent tool for research
	and applications requiring high levels of reasoning. With over 4,000 manually verified reasoning steps in its benchmark evaluations,
	LlamaV-o1 sets a new standard for multimodal reasoning, delivering consistent and reliable results across challenging scenarios.

	### Key Features:
	- Model Size: 11 billion parameters.
	- Architecture: Based on the Llama (Large Language Model Architecture) family.
	- Fine-Tuning: Enhanced for instruction-following, chain-of-thought reasoning, and robust generalization across tasks.
	- Applications: Ideal for use cases such as conversational agents, educational tools, content creation, and more.
	---
	## Model Details
	- Developed By: MBZUAI
	- Model Version: v0.1
	- Release Date: 13th January 2025
	- Training Dataset: Diverse multilingual corpus, including high-quality sources for instruction tuning, chain-of-thought datasets, and general-purpose corpora.
	- Framework: Pytorch
	---

	## Intended Use
	LlamaV-o1 is designed for a wide range of NLP tasks, including but not limited to:
	- Text Generation
	- Sentiment Analysis
	- Text Summarization
	- Question Answering
	- Chain-of-Thought Reasoning

	### Out-of-Scope Use
	The model should not be used in applications requiring high-stakes decision-making, such as healthcare diagnosis, financial predictions, or any scenarios involving potential harm.
	---

	## Training Procedure
	- Fine-Tuning: The model was fine-tuned on a dataset optimized for reasoning, coherence, and diversity, leveraging instruction-tuning techniques to enhance usability in downstream applications.
	- Optimizations: Includes inference scaling optimizations to balance performance and computational efficiency.
	---
	## Evaluation

	### Benchmarks
	LlamaV-o1 has been evaluated on a suite of benchmark tasks:
	- Reasoning: [VRC-Bench](https://huggingface.co/datasets/omkarthawakar/VRC-Bench)


	### Limitations
	While the model performs well on a broad range of tasks, it may struggle with:
	- Highly technical, domain-specific knowledge outside the training corpus.
	- Generating accurate outputs for ambiguous or adversarial prompts.
	---
	## Usage
	```python
	from transformers import MllamaForConditionalGeneration, AutoProcessor

	model_id = "omkarthawakar/LlamaV-o1"

	model = MllamaForConditionalGeneration.from_pretrained(
	model_id,
	torch_dtype=torch.bfloat16,
	device_map="auto",
	)
	processor = AutoProcessor.from_pretrained(model_id)
	```

	Please refer to [llamav-o1.py](https://github.com/mbzuai-oryx/LlamaV-o1/blob/main/eval/llamav-o1.py) for inference.

	### Results
	Table 1: Comparison of models based on Final Answer accuracy and Reasoning Steps performance on the proposed VRC-Bench. The best results in each case (closed-source and open-source) are in bold. Our LlamaV-o1 achieves superior performance compared to its open-source counterpart (Llava-CoT) while also being competitive against the closed-source models.

	\| Model \| GPT-4o \| Claude-3.5 \| Gemini-2.0 \| Gemini-1.5 Pro \| Gemini-1.5 Flash \| GPT-4o Mini \| Llama-3.2 Vision \| Mulberry \| Llava-CoT \| LlamaV-o1 (Ours) \|
	\|-------------\|------------\|----------------\|----------------\|-------------------\|--------------------\|----------------\|--------------------\|-------------\|--------------\|-------------------\|
	\| Final Answer \| 59.28 \| 61.35 \| 61.16 \| 61.35 \| 54.99 \| 56.39 \| 48.40 \| 51.90 \| 54.09 \| 56.49 \|
	\| Reasoning Steps \| 76.68 \| 72.12 \| 74.08 \| 72.12 \| 71.86 \| 74.05 \| 58.37 \| 63.86 \| 66.21 \| 68.93 \|
	---

	### Training Data

	LlamaV-o1 is trained on the [LLaVA-CoT-100k dataset](https://huggingface.co/datasets/Xkev/LLaVA-CoT-100k).
	We have formatted training sample for multi-step reasoning.

	### Training Procedure

	LlamaV-o1 model is finetuned on [llama-recipes](https://github.com/Meta-Llama/llama-recipes).
	Detailed Training procedure will be coming soon!

	### Citation
	Coming Soon!