If the response length exceeds 4096, is a sliding window used, or is it simply truncated?

by ShelterW - opened 1 day ago

1 day ago

step_reward = make_step_rewards(logits, token_masks)

product_step_reward = 1.0  
    for reward in step_reward:
        product_step_reward *= reward

According to the paper, the score of each candidate response is calculated as the product of the individual scores of each step within the response. Then, how to weigh the difference between responses with fewer steps and those with more steps for the same question?

How can PRM@8 be combined with QwQ to demonstrate the best performance?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment