SILMA RAGQA V1.0: A Comprehensive Benchmark for Evaluating LLMs on RAG QA Use-Cases

Community Article Published December 18, 2024

SILMA RAGQA is a benchmark curated by silma.ai to assess the effectiveness of Arabic/English Language Models in Extractive Question Answering tasks, with a specific emphasis on RAG applications

The benchmark includes 17 bilingual datasets in Arabic and English, spanning various domains


What capabilities does the benchmark test?

  • General Arabic and English QA capabilities
  • Ability to handle short and long contexts
  • Ability to provide short and long answers effectively
  • Ability to answer complex numerical questions
  • Ability to answer questions based on tabular data
  • Multi-hop question answering: ability to answer one question using pieces of data from multiple paragraphs
  • Negative Rejection: ability to identify and dismiss inaccurate responses, providing a more precise statement such as "answer can't be found in the provided context."
  • Multi-domain: ability to answer questions based on texts from different domains such as financial, medical, etc.
  • Noise Robustness: ability to handle noisy and ambiguous contexts

Data Sources

Name Lang Size (Sampled) Link Paper
xquad_r en 100 https://huggingface.co/datasets/google-research-datasets/xquad_r/viewer/en https://arxiv.org/pdf/2004.05484
xquad_r ar 100 https://huggingface.co/datasets/google-research-datasets/xquad_r/viewer/ar https://arxiv.org/pdf/2004.05484
rag_instruct_benchmark_tester en 100 https://huggingface.co/datasets/llmware/rag_instruct_benchmark_tester https://medium.com/@darrenoberst/how-accurate-is-rag-8f0706281fd9
covidqa en 50 https://huggingface.co/datasets/rungalileo/ragbench/viewer/covidqa/test https://arxiv.org/abs/2407.11005
covidqa ar 50 translated from covidqa_en using Google Translate https://arxiv.org/abs/2407.11005
emanual en 50 https://huggingface.co/datasets/rungalileo/ragbench/viewer/emanual/test https://arxiv.org/abs/2407.11005
emanual ar 50 translated from emanual_en using Google Translate https://arxiv.org/abs/2407.11005
msmarco en 50 https://huggingface.co/datasets/rungalileo/ragbench/viewer/msmarco/test https://arxiv.org/abs/2407.11005
msmarco ar 50 translated from msmarco_en using Google Translate https://arxiv.org/abs/2407.11005
hotpotqa en 50 https://huggingface.co/datasets/rungalileo/ragbench/viewer/hotpotqa/test https://arxiv.org/abs/2407.11005
expertqa en 50 https://huggingface.co/datasets/rungalileo/ragbench/viewer/expertqa/test https://arxiv.org/abs/2407.11005
finqa en 50 https://huggingface.co/datasets/rungalileo/ragbench/viewer/finqa/test https://arxiv.org/abs/2407.11005
finqa ar 50 translated from finqa_en using Google Translate https://arxiv.org/abs/2407.11005
tatqa en 50 https://huggingface.co/datasets/rungalileo/ragbench/viewer/tatqa/test https://arxiv.org/abs/2407.11005
tatqa ar 50 translated from tatqa_en using Google Translate https://arxiv.org/abs/2407.11005
boolq ar 100 https://huggingface.co/datasets/Hennara/boolq_ar https://arxiv.org/pdf/1905.10044
sciq ar 100 https://huggingface.co/datasets/Hennara/sciq_ar https://arxiv.org/pdf/1707.06209

SLM Evaluations

CleanShot 2024-12-14 at 23.34.12@2x.png

SILMA Kashif is a new model will be released early Jan 2025

Model Name Benchmark Score
SILMA-9B-Instruct-v1.0 0.268
Gemma-2-2b-it 0.281
Qwen2.5-3B-Instruct 0.3
Phi-3.5-mini-instruct 0.301
Gemma-2-9b-it 0.304
Phi-3-mini-128k-instruct 0.306
Llama-3.2-3B-Instruct 0.318
Qwen2.5-7B-Instruct 0.321
Llama-3.1-8B-Instruct 0.328
c4ai-command-r7b-12-2024 0.330
SILMA-Kashif-2B-v0.1 0.357

How to evaluate your model?

Follow the steps on the benchmark page