Overview

This reward model is trained to predict human preferences between pairs of responses to various prompts. It is designed to be used as part of a Reinforcement Learning from Human Feedback (RLHF) pipeline.

Model Architecture

  • Base Model: Llama3-8B with SFT & DPO
  • Output: Single scalar reward value
  • Parameters: 8B
  • Training Framework: DeepSpeed + TRL

Example Usage

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

device = 'cuda:0'
model_name = "Nagi-ovo/Llama-3-8B-RM"

model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    load_in_4bit=True, 
    bnb_4bit_quant_type="nf4",
)

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

SYSTEM_PROMPT = "You are a helpful assistant"

def format_prompt_answer(prompt, answer):
    """Format the input for reward model evaluation"""
    return f"###System: {SYSTEM_PROMPT}\n###Question: {prompt}\n###Answer: {answer}<|end_of_text|>"

def get_reward_score(prompt, answer):
    """Get reward score for a given prompt-answer pair"""
    formatted_input = format_prompt_answer(prompt, answer)
    inputs = tokenizer(formatted_input, return_tensors='pt').to(device)
    
    with torch.no_grad():
        output = model(inputs['input_ids']).logits
    
    return output.item()

prompt = "How are you?"
answer = "I'm doing great! Thank you for asking. How can I help you today?"
    
score = get_reward_score(prompt, answer)
print(f"Prompt: {prompt}")
print(f"Answer: {answer}")
print(f"Reward Score: {score}")
Downloads last month
100
Safetensors
Model size
8.03B params
Tensor type
BF16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for Nagi-ovo/Llama-3-8B-RM

Finetuned
(2)
this model
Quantizations
2 models

Dataset used to train Nagi-ovo/Llama-3-8B-RM

Collection including Nagi-ovo/Llama-3-8B-RM