RohitSahoo/llama-2-7b-chat-hf-math-ft-V1

How to test:
import torch
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig,
    pipeline,
    logging,
)

model_name = "RohitSahoo/llama-2-7b-chat-hf-math-ft-V1"


# load the quantized settings, we're doing 4 bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=False,
)

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    # use the gpu
    device_map={"":0}
)

# don't use the cache
model.config.use_cache = False

# Load the tokenizer from the model (llama2)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True, use_fast=False)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"


text = '''Reward Modelling Phase: In the second phase the SFT model is prompted with prompts x to
produce pairs of answers (y1, y2 ) ∼ π SFT (y | x). These are then presented to human labelers
who express preferences for one answer, denoted as yw ≻ yl | x where yw and yl denotes the
preferred and dispreferred completion amongst (y1 , y2 ) respectively. The preferences are assumed
to be generated by some latent reward model r∗ (y, x), which we do not have access to. There are a
number of approaches used to model preferences, the Bradley-Terry (BT) [5] model being a popular
choice (although more general Plackett-Luce ranking models [30, 21] are also compatible with the
framework if we have access to several ranked answers). The BT model stipulates that the human
preference distribution p∗ can be written as:
p∗ (y1 ≻ y2 | x) =
exp (r∗ (x, y1 ))
.
exp (r∗
(x, y1 )) + exp (r ∗ (x, y2))
(1)
Assuming access to a static dataset of comparisons D = 
x(i), yw (i) , yl (i) N
i=1 sampled from p∗ , we
can parametrize a reward model rφ (x, y) and estimate the parameters via maximum likelihood.
Framing the problem as a binary classification we have the negative log-likelihood loss:
LR(rφ , D) = −E(x,yw ,yl )∼D 
log σ(rφ (x, yw ) − rφ (x, yl )).

Please explain the math behind this paper by explaining all the variables'''


import time
start = time.time()

logging.set_verbosity(logging.CRITICAL)

pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=1500)
result = pipe(f"<s>[INST] {text} [/INST]")
print(result[0]['generated_text'])

end = time.time()
print(end - start)