If the response length exceeds 4096, is a sliding window used, or is it simply truncated?
#6
by
ShelterW
- opened
step_reward = make_step_rewards(logits, token_masks)
product_step_reward = 1.0
for reward in step_reward:
product_step_reward *= reward
According to the paper, the score of each candidate response is calculated as the product of the individual scores of each step within the response. Then, how to weigh the difference between responses with fewer steps and those with more steps for the same question?
How can PRM@8 be combined with QwQ to demonstrate the best performance?