Spaces:
Running
Running
title: VARCO Arena | |
emoji: 🔥 | |
colorFrom: pink | |
colorTo: yellow | |
sdk: streamlit | |
sdk_version: 1.40.2 | |
app_file: app.py | |
pinned: false | |
license: cc-by-4.0 | |
short_description: VARCO Arena is a reference-free LLM benchmarking approach | |
# Varco Arena | |
Varco Arena conducts tournaments between models to be compared for each test set command, ranking models accurately at an affordable price. This is more accurate and cost-effective than rating win rates by comparing against reference outputs. | |
For more information, the followings may help understanding how it works. | |
* [Paper](https://huggingface.co/papers/2411.01281) | |
* [Blog Post (KR)](https://ncsoft.github.io/ncresearch/12cc62c1ea0d981971a8923401e8fe6a0f18563d) | |
## Quickstart | |
### Running Web Demo locally (streamlit, Recommended!) | |
```bash | |
git clone [THIS_REPO] | |
# install requirements below. we recommend miniforge to manage environment | |
cd streamlit_app_local | |
bash run.sh | |
``` | |
For more details, see `[THIS_REPO]/streamlit_app_local/README.md` | |
### CLI use | |
* located at | |
* `varco_arena/` | |
* debug configurations for vscode at | |
* `varco_arena/.vscode` | |
```bash | |
## gpt-4o-mini as a judge | |
python main.py -i "./some/dirpath/to/jsonl/files" -o SOME_REL_PATH_TO_CREATE -m tournament -e "gpt-4o-mini" | |
## vllm-openai served LLM as a judge | |
python main.py -i "./some/dirpath/to/jsonl/files" -o SOME_REL_PATH_TO_CREATE -e SOME_MODEL_NAME_SERVED -m tournament -u "http://url_to/your/vllm_openai_server:someport" | |
# dbg lines | |
## openai api judge dbg | |
python main.py -i "rsc/inputs_for_dbg/dbg_400_error_inputs/" -o SOME_WANTED_TARGET_DIR -e gpt-4o-mini | |
## other testing lines | |
python main.py -i "rsc/inputs_for_dbg/[SOME_DIRECTORY]/" -o SOME_WANTED_TARGET_DIR -e gpt-4o-mini | |
## dummy judge dbg (checking errors without api requests) | |
python main.py -i "rsc/inputs_for_dbg/dbg_400_error_inputs/" -o SOME_WANTED_TARGET_DIR -e debug | |
``` | |
## Requirements | |
We tested this on `python = 3.11.9` env: `requirements.txt` | |
``` | |
openai>=1.17.0 | |
munch | |
pandas | |
numpy | |
tqdm>=4.48.0 | |
plotly | |
scikit-learn | |
kaleido | |
tiktoken>=0.7.0 | |
pyyaml | |
transformers | |
streamlit>=1.40.2 | |
openpyxl | |
fire==0.6.0 | |
git+https://github.com/shobrook/openlimit.git#egg=openlimit # do not install this by pypi | |
# Linux | |
uvloop | |
# Windows | |
winloop | |
``` | |
#### Argument | |
- -i, --input : directory path which contains input jsonlines files (llm outputs) | |
- -o, --output_dir : directory where results to be put | |
- -e, --evaluation : judge model specification (e.g. "gpt-4o-2024-05-13", "gpt-4o-mini", \[vllm-served-model-name\]) | |
- -k, --openai_api_key : OpenAI API Key | |
- -u, --openai_url: URL to openai_styled_llm_server (requested by openai sdk) | |
#### advanced | |
- -j, --n_jobs : n jobs to be put to `asyncio.semaphore(n=)` | |
- -p, --evalprompt : [see the directory](./varco_arena/prompts/*.yaml) | |
- -lr, --limit_requests : vLLM OpenAI server request limit (default: 7,680) | |
- -lt, --limit_tokens : vLLM OpenAI server token limit (default: 15,728,640) | |
#### Input Data Format | |
[input jsonl guides](./streamlit_app_local/guide_mds/input_jsonls_en.md) | |
## Contributing & Customizing | |
#### Do this after git clone and installation | |
```bash | |
pip install pre-commit | |
pre-commit install | |
``` | |
#### before commit | |
```bash | |
bash precommit.sh # black formatter will reformat the codes | |
``` | |
## FAQ | |
* I want to apply my custom judge prompt to run Varco Arena | |
* [`./varco_arena/prompts/`](./varco_arena/prompts/__init__.py) defines the prompts with `yaml` file and the class objects for those. Edit those as your need. | |
* I want tailored judge prompts for each line of the test set row (i.e. ~100th row - `prompt1`, 101st~ - `prompt2`) | |
* You could see `load_prompt` at the above link receives `promptname` + `task` as a parameters to load the prompt. The function is called at [`./varco_arena/manager.py:async_run`](./varco_arena/manager.py). | |
* I want more fields for my llm outputs jsonl files for tailored use, i.e. want more fields beyond `instruction`, `source`, `generated`. | |
* It's going to get tricky but let me briefly guide you about this. | |
* You might have to edit `varco_arena/eval_utils.py`:`async_eval_w_prompt` (this part calls `PROMPT_OBJ.complete_prompt()`) | |
* And all the related codes will require revision. | |
## Special Thanks to (contributors) | |
- Minho Lee (@Dialogue Model Team, NCSOFT) [github](https://github.com/minolee/) | |
- query wrapper | |
- rag prompt | |
- Jumin Oh (@Generation Model Team, NCSOFT) | |
- overall prototyping of the system in haste | |
## Citation | |
If you found our work helpful, consider citing our paper! | |
``` | |
@misc{son2024varcoarenatournamentapproach, | |
title={Varco Arena: A Tournament Approach to Reference-Free Benchmarking Large Language Models}, | |
author={Seonil Son and Ju-Min Oh and Heegon Jin and Cheolhun Jang and Jeongbeom Jeong and Kuntae Kim}, | |
year={2024}, | |
eprint={2411.01281}, | |
archivePrefix={arXiv}, | |
primaryClass={cs.CL}, | |
url={https://arxiv.org/abs/2411.01281}, | |
} | |
``` | |