|
from dataclasses import dataclass |
|
from enum import Enum |
|
|
|
|
|
@dataclass |
|
class Task: |
|
benchmark: str |
|
metric: str |
|
col_name: str |
|
|
|
|
|
|
|
class Tasks(Enum): |
|
|
|
task0 = Task("agree_cs", "accuracy", "agree_cs") |
|
task1 = Task("anli_cs", "accuracy", "anli_cs") |
|
task2 = Task("arc_challenge_cs", "accuracy", "arc_challenge_cs") |
|
task3 = Task("arc_easy_cs", "accuracy", "arc_easy_cs") |
|
task4 = Task("belebele_cs", "accuracy", "belebele_cs") |
|
task5 = Task("ctkfacts_cs", "accuracy", "ctkfacts_cs") |
|
task6 = Task("czechnews_cs", "accuracy", "czechnews_cs") |
|
task7 = Task("fb_comments_cs", "accuracy", "fb_comments_cs") |
|
task8 = Task("gsm8k_cs", "accuracy", "gsm8k_cs") |
|
task9 = Task("klokanek_cs", "accuracy", "klokanek_cs") |
|
task10 = Task("mall_reviews_cs", "accuracy", "mall_reviews_cs") |
|
task11 = Task("mmlu_cs", "accuracy", "mmlu_cs") |
|
task12 = Task("sqad_cs", "accuracy", "sqad_cs") |
|
task13 = Task("subjectivity_cs", "accuracy", "subjectivity_cs") |
|
task14 = Task("truthfulqa_cs", "accuracy", "truthfulqa_cs") |
|
|
|
|
|
TITLE = """<h1 align="center" id="space-title">🇨🇿 CzechBench Leaderboard</h1>""" |
|
|
|
TABLE_DESC = "The values shown in the leaderboard table represent the accuracy metric in percentage." |
|
|
|
|
|
INTRODUCTION_OLD = """ |
|
Czech-Bench is a collection of LLM benchmarks available for the Czech language. It currently consists of 15 Czech benchmarks, including new machine translations of the popular ARC, GSM8K, MMLU, and TruthfulQA datasets. |
|
|
|
Czech-Bench is developed by <a href="https://huggingface.co/CIIRC-NLP">CIIRC-NLP</a>. |
|
""" |
|
|
|
|
|
INTRODUCTION_TEXT = f""" |
|
The goal of the CzechBench project is to provide a comprehensive and practical benchmark for evaluating Czech language models. |
|
Our [evaluation suite](https://github.com/jirkoada/czechbench_eval_harness/tree/main/lm_eval/tasks/czechbench#readme) |
|
currently consists of 15 individual tasks, leveraging pre-existing Czech datasets together with new machine translations of popular LLM benchmarks, |
|
including ARC, GSM8K, MMLU, and TruthfulQA. This work is brought to you by CIIRC CTU and VSB Ostrava. |
|
|
|
Key Features and Benefits: |
|
- **Tailored for the Czech Language:** |
|
CzechBench includes both original Czech datasets and adapted versions of international datasets, ensuring relevant evaluation of model performance in the Czech context. |
|
- **Wide Range of Tasks:** |
|
It contains 15 different tasks that cover various aspects of language understanding and text generation, enabling a comprehensive assessment of the model's capabilities. |
|
- **Bilingual performance analysis:** |
|
CzechBench also offers a parallel collection of 9 English tasks corresponding to the Czech versions included in the main suite. |
|
This allows for direct comparison of model performance across both languages with equivalent conditions in terms of prompt formulation and few-shot example selection. |
|
- **Universal model support:** |
|
The universal text-to-text evaluation approach adopted in CzechBench allows for direct comparison of models with varying levels of internal access, including commercial APIs. |
|
- **Ease of Use:** |
|
The benchmark is built upon a commonly used evaluation framework with wide support for state-of-the-art models and inference acceleration tools. |
|
- **Empowering decisions:** |
|
Whether you are a business looking for the best LLM solution to base your application on, or a research team trying to maximize the capabilities of the models they are developing, |
|
CzechBench will help you gain insights into particular strengths and weeknesses of individual models and better focus on key areas for optimization. |
|
|
|
Below, you can find the up-to-date loaderboard of models evaluated on CzechBench. |
|
For more information on the included benchmarks and instructions on evaluating your own models, please visit the "About" section below. |
|
""" |
|
|
|
|
|
|
|
LLM_BENCHMARKS_TEXT = f""" |
|
## Basic Information |
|
|
|
The CzechBench evaluation suite is hosted on [GitHub](https://github.com/jirkoada/czechbench_eval_harness/tree/main/lm_eval/tasks/czechbench#readme). |
|
It is implemented on top of the popular [Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) framework, which provides extensive model compatibility and optimal evaluation efficiency. |
|
|
|
All currently supported benchmarks are listed in the table below: |
|
|
|
| Dataset | Language | Task type | Metrics | Samples | Task ID | |
|
| ------------------------------------------------------------ | ----------------------------- | -------------------------- | -------------- | ------: | --------------- | |
|
| [AGREE](https://github.com/jirkoada/czechbench_eval_harness/tree/main/lm_eval/tasks/czechbench/agree_cs) | CS (Original) | Subject-verb agreement | Acc | 627 | agree_cs | |
|
| [ANLI](https://github.com/jirkoada/czechbench_eval_harness/tree/main/lm_eval/tasks/czechbench/anli_cs) | CS (Translated) | Natural Language Inference | Acc, Macro F1 | 1200 | anli_cs | |
|
| [ARC Challenge](https://github.com/jirkoada/czechbench_eval_harness/tree/main/lm_eval/tasks/czechbench/arc_cs) | CS (Translated) | Knowledge-Based QA | Acc | 1172 | arc_cs | |
|
| [ARC Easy](https://github.com/jirkoada/czechbench_eval_harness/tree/main/lm_eval/tasks/czechbench/arc_cs) | CS (Translated) | Knowledge-Based QA | Acc | 2376 | arc_cs | |
|
| [Belebele](https://github.com/jirkoada/czechbench_eval_harness/tree/main/lm_eval/tasks/czechbench/belebele_cs) | CS (Professional translation) | Reading Comprehension / QA | Acc | 895 | belebele_cs | |
|
| [CTKFacts](https://github.com/jirkoada/czechbench_eval_harness/tree/main/lm_eval/tasks/czechbench/ctkfacts_cs) | CS (Original) | Natural Language Inference | Acc, Macro F1 | 558 | ctkfacts_cs | |
|
| [Czech News](https://github.com/jirkoada/czechbench_eval_harness/tree/main/lm_eval/tasks/czechbench/czechnews_cs) | CS (Original) | News Topic Classification | Acc, Macro F1 | 1000 | czechnews_cs | |
|
| [Facebook Comments](https://github.com/jirkoada/czechbench_eval_harness/tree/main/lm_eval/tasks/czechbench/fb_comments_cs) | CS (Original) | Sentiment Analysis | Acc, Macro F1 | 1000 | fb_comments_cs | |
|
| [GSM8K](https://github.com/jirkoada/czechbench_eval_harness/tree/main/lm_eval/tasks/czechbench/gsm8k_cs) | CS (Translated) | Mathematical inference | EM Acc | 1319 | gsm8k_cs | |
|
| [Klokánek](https://github.com/jirkoada/czechbench_eval_harness/tree/main/lm_eval/tasks/czechbench/klokanek_cs) | CS (Original) | Math/Logical Inference | Acc | 808 | klokanek_cs | |
|
| [Mall Reviews](https://github.com/jirkoada/czechbench_eval_harness/tree/main/lm_eval/tasks/czechbench/mall_reviews_cs) | CS (Original) | Sentiment Analysis | Acc, Macro F1 | 3000 | mall_reviews_cs | |
|
| [MMLU](https://github.com/jirkoada/czechbench_eval_harness/tree/main/lm_eval/tasks/czechbench/mmlu_cs) | CS (Translated) | Knowledge-Based QA | Acc | 12408 | mmlu_cs | |
|
| [SQAD](https://github.com/jirkoada/czechbench_eval_harness/tree/main/lm_eval/tasks/czechbench/sqad_cs) | CS (Original) | Reading Comprehension / QA | EM Acc, BoW F1 | 843 | sqad_cs | |
|
| [Subjectivity](https://github.com/jirkoada/czechbench_eval_harness/tree/main/lm_eval/tasks/czechbench/subjectivity_cs) | CS (Original) | Subjectivity Analysis | Acc, Macro F1 | 2000 | subjectivity_cs | |
|
| [TruthfulQA](https://github.com/jirkoada/czechbench_eval_harness/tree/main/lm_eval/tasks/czechbench/truthfulqa_cs) | CS (Translated) | Knowledge-Based QA | Acc | 813 | truthfulqa_cs | |
|
|
|
The leaderboard table also displays aggregated scores across task categories, including: |
|
- **Grammar (Avg.):** AGREE |
|
- **Knowledge (Avg.):** ARC-Challenge, ARC-Easy, MMLU, TruthfulQA |
|
- **Reasoning (Avg.):** ANLI, Belebele, CTKFacts, SQAD |
|
- **Math (Avg.):** GSM8K, Klokanek |
|
- **Classification (Avg.):** Czech News, Facebook Comments, Mall Reviews, Subjectivity |
|
- **Aggregate Score:** Average over above categories |
|
|
|
## Evaluation Process |
|
|
|
### 1. Install CzechBench: |
|
``` |
|
git clone https://github.com/jirkoada/czechbench_eval_harness.git |
|
cd czechbench_eval_harness |
|
pip install -e “.[api]” |
|
``` |
|
|
|
### 2. Run evaluation |
|
* `export MODEL=your_model_name` where your_model_name is HF path for public model. For example: `export MODEL=meta-llama/Meta-Llama-3.1-8B-Instruct` |
|
* `export OUTPUT_PATH=my_output_path` where my_output_path is directory for evaluation reports |
|
|
|
|
|
Run following command (you can adjust parameters like batch_size or device): |
|
``` |
|
lm_eval --model hf \\ |
|
--model_args pretrained=$MODEL \\ |
|
--tasks czechbench_tasks \\ |
|
--device cuda:0 \\ |
|
--batch_size 1 \\ |
|
--write_out \\ |
|
--log_samples \\ |
|
--output_path $OUTPUT_PATH \\ |
|
--apply_chat_template \\ |
|
``` |
|
|
|
For advanced usage instructions, please inspect the [CzechBench README on GitHub](https://github.com/jirkoada/czechbench_eval_harness/tree/main/lm_eval/tasks/czechbench#readme) |
|
or the official [LM Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) documentation. |
|
|
|
|
|
### 3. Upload results to Leaderboard |
|
Inside the `$OUTPUT_PATH` directory, you can find the file `results.json`. |
|
To submit your evaluation results to our leaderboard, please visit the "Submit here!" section above and upload your `results.json` file. |
|
|
|
""" |
|
|
|
EVALUATION_QUEUE_TEXT = """ |
|
|
|
""" |
|
|
|
CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results" |
|
CITATION_BUTTON_TEXT = r"""@misc{czechbench, |
|
title = {CzechBench Leaderboard}, |
|
author = {Adam Jirkovský and David Adamczyk and Jan Hůla and Jan Šedivý}, |
|
year = {2024}, |
|
url = {https://huggingface.co/spaces/CIIRC-NLP/czechbench_leaderboard} |
|
} |
|
|
|
@masterthesis{jirkovsky-thesis, |
|
author = {Jirkovský, Adam}, |
|
title = {Benchmarking Techniques for Evaluation of Large Language Models}, |
|
school = {Czech Technical University in Prague, Faculty of Electrical Engineering}, |
|
year = 2024, |
|
URL = {https://dspace.cvut.cz/handle/10467/115227} |
|
}""" |
|
|