PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models
Abstract
Process-level Reward Models (PRMs) are crucial for complex reasoning and decision-making tasks, where each intermediate step plays an important role in the reasoning process. Since language models are prone to various types of errors during the reasoning process, PRMs are required to possess nuanced capabilities for detecting various implicit error types in real-world scenarios. However, current benchmarks primarily focus on step correctness, failing to evaluate PRMs' performance systematically. To address this gap, we introduce PRMBench, a process-level benchmark specifically designed to assess the fine-grained error detection capabilities of PRMs. PRMBench comprises 6,216 carefully designed problems and 83,456 step-level labels, evaluating models across multiple dimensions, including simplicity, soundness, and sensitivity. In our experiments on 15 models, spanning both open-source PRMs and closed-source large language models prompted as critic models, we uncover significant weaknesses in current PRMs. These findings underscore the challenges inherent in process-level evaluation and highlight key directions for future research. We hope PRMBench can be a robust bench for advancing research on PRM evaluation and development.
Community
Is your Process-Level Reward Model really good? 🤔 We're thrilled to release PRMBENCH: A Fine-grained and Challenging Benchmark for Process-Level Reward Models! This new resource offers a deeper dive into PRM evaluation.
Our code, arXiv paper, data, and project page are released:
🌐project page: https://prmbench.github.io/
💻 code: https://github.com/ssmisya/PRMBench
📄paper: https://arxiv.org/abs/2501.03124
📊data: https://huggingface.co/datasets/hitsmy/PRMBench_Preview
Also, our PRM Eval Toolkit
codebase supports evaluating different kinds of PRMs and different custom tasks, providing a universal PRM Evaluation harness. welcome to use🤗 !
PRM Eval Toolkit Documentation: https://github.com/ssmisya/PRMBench/blob/main/docs/document.md
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- ProcessBench: Identifying Process Errors in Mathematical Reasoning (2024)
- RAG-RewardBench: Benchmarking Reward Models in Retrieval Augmented Generation for Preference Alignment (2024)
- VISCO: Benchmarking Fine-Grained Critique and Correction Towards Self-Improvement in Visual Reasoning (2024)
- Outcome-Refining Process Supervision for Code Generation (2024)
- SRA-MCTS: Self-driven Reasoning Augmentation with Monte Carlo Tree Search for Code Generation (2024)
- Beyond Examples: High-level Automated Reasoning Paradigm in In-Context Learning via MCTS (2024)
- Compound-QA: A Benchmark for Evaluating LLMs on Compound Questions (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper