Lighteval documentation

Metric List

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v0.7.0).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Metric List

Automatic metrics for multiple choice tasks

These metrics use log-likelihood of the different possible targets.

  • loglikelihood_acc: Fraction of instances where the choice with the best logprob was correct - also exists in a faster version for tasks where the possible choices include only one token (loglikelihood_acc_single_token)
  • loglikelihood_acc_norm: Fraction of instances where the choice with the best logprob, normalized by sequence length, was correct - also exists in a faster version for tasks where the possible choices include only one token (loglikelihood_acc_norm_single_token)
  • loglikelihood_acc_norm_nospace: Fraction of instances where the choice with the best logprob, normalized by sequence length, was correct, with the first space ignored
  • loglikelihood_f1: Corpus level F1 score of the multichoice selection - also exists in a faster version for tasks where the possible choices include only one token (loglikelihood_f1_single_token)
  • mcc: Matthew’s correlation coefficient (a measure of agreement between statistical distributions),
  • recall_at_1: Fraction of instances where the choice with the best logprob was correct - also exists in a faster version for tasks where the possible choices include only one token per choice (recall_at_1_single_token)
  • recall_at_2: Fraction of instances where the choice with the 2nd best logprob or better was correct - also exists in a faster version for tasks where the possible choices include only one token per choice (recall_at_2_single_token)
  • mrr: Mean reciprocal rank, a measure of the quality of a ranking of choices ordered by correctness/relevance - also exists in a faster version for tasks where the possible choices include only one token (mrr_single_token)
  • target_perplexity: Perplexity of the different choices available.
  • acc_golds_likelihood:: A bit different, it actually checks if the average logprob of a single target is above or below 0.5
  • multi_f1_numeric: Loglikelihood F1 score for multiple gold targets

All these metrics also exist in a “single token” version (loglikelihood_acc_single_token, loglikelihood_acc_norm_single_token, loglikelihood_f1_single_token, mcc_single_token, recall@2_single_token and mrr_single_token). When the multichoice option compares only one token (ex: “A” vs “B” vs “C” vs “D”, or “yes” vs “no”), using these metrics in the single token version will divide the time spent by the number of choices. Single token evals also include:

  • multi_f1_numeric: computes the f1 score of all possible choices and averages it.

Automatic metrics for perplexity and language modeling

These metrics use log-likelihood of prompt.

  • word_perplexity: Perplexity (log probability of the input) weighted by the number of words of the sequence.
  • byte_perplexity: Perplexity (log probability of the input) weighted by the number of bytes of the sequence.
  • bits_per_byte: Average number of bits per byte according to model probabilities.
  • log_prob: Predicted output’s average log probability (input’s log prob for language modeling).

Automatic metrics for generative tasks

These metrics need the model to generate an output. They are therefore slower.

  • Base:
    • perfect_exact_match: Fraction of instances where the prediction matches the gold exactly.
    • exact_match: Fraction of instances where the prediction matches the gold with the exception of the border whitespaces (= after a strip has been applied to both).
    • quasi_exact_match: Fraction of instances where the normalized prediction matches the normalized gold (normalization done on whitespace, articles, capitalization, …). Other variations exist, with other normalizers, such as quasi_exact_match_triviaqa, which only normalizes the predictions after applying a strip to all sentences.
    • prefix_exact_match: Fraction of instances where the beginning of the prediction matches the gold at the exception of the border whitespaces (= after a strip has been applied to both).
    • prefix_quasi_exact_match: Fraction of instances where the normalized beginning of the prediction matches the normalized gold (normalization done on whitespace, articles, capitalization, …)
    • exact_match_indicator: Exact match with some preceding context (before an indicator) removed
    • f1_score_quasi: Average F1 score in terms of word overlap between the model output and gold, with both being normalized first
    • f1_score: Average F1 score in terms of word overlap between the model output and gold without normalisation
    • f1_score_macro: Corpus level macro F1 score
    • f1_score_macro: Corpus level micro F1 score
    • maj_at_5 and maj_at_8: Model majority vote. Takes n (5 or 8) generations from the model and assumes the most frequent is the actual prediction.
  • Summarization:
    • rouge: Average ROUGE score (Lin, 2004)
    • rouge1: Average ROUGE score (Lin, 2004) based on 1-gram overlap.
    • rouge2: Average ROUGE score (Lin, 2004) based on 2-gram overlap.
    • rougeL: Average ROUGE score (Lin, 2004) based on longest common subsequence overlap.
    • rougeLsum: Average ROUGE score (Lin, 2004) based on longest common subsequence overlap.
    • rouge_t5 (BigBench): Corpus level ROUGE score for all available ROUGE metrics
    • faithfulness: Faithfulness scores based on the SummaC method of Laban et al. (2022).
    • extractiveness: Reports, based on (Grusky et al., 2018)
      • summarization_coverage: Extent to which the model-generated summaries are extractive fragments from the source document,
      • summarization_density: Extent to which the model-generated summaries are extractive summaries based on the source document,
      • summarization_compression: Extent to which the model-generated summaries are compressed relative to the source document.
    • bert_score: Reports the average BERTScore precision, recall, and f1 score (Zhang et al., 2020) between model generation and gold summary.
    • Translation
    • bleu: Corpus level BLEU score (Papineni et al., 2002) - uses the sacrebleu implementation.
    • bleu_1: Average sample BLEU score (Papineni et al., 2002) based on 1-gram overlap - uses the nltk implementation.
    • bleu_4: Average sample BLEU score (Papineni et al., 2002) based on 4-gram overlap - uses the nltk implementation.
    • chrf: Character n-gram matches f-score.
    • ter: Translation edit/error rate.
  • Copyright
    • copyright: Reports:
      • longest_common_prefix_length: average length of longest common prefix between model generation and reference,
      • edit_distance: average Levenshtein edit distance between model generation and reference,
      • edit_similarity: average Levenshtein edit similarity (normalized by length of longer sequence) between model generation and reference.
  • Math:
    • quasi_exact_match_math: Fraction of instances where the normalized prediction matches the normalized gold (normalization done for math, where latex symbols, units, etc are removed)
    • maj_at_4_math: Majority choice evaluation, using the math normalisation for the predictions and gold
    • quasi_exact_match_gsm8k: Fraction of instances where the normalized prediction matches the normalized gold (normalization done for gsm8k, where latex symbols, units, etc are removed)
    • maj_at_8_gsm8k: Majority choice evaluation, using the gsm8k normalisation for the predictions and gold

LLM-as-Judge

  • llm_judge_gpt3p5: Can be used for any generative task, the model will be scored by a GPT3.5 model using the OpenAI API
  • llm_judge_llama_3_405b: Can be used for any generative task, the model will be scored by a Llama 3.405B model using the HuggingFace API
  • llm_judge_multi_turn_gpt3p5: Can be used for any generative task, the model will be scored by a GPT3.5 model using the OpenAI API. It is used for multiturn tasks like mt-bench.
  • llm_judge_multi_turn_llama_3_405b: Can be used for any generative task, the model will be scored by a Llama 3.405B model using the HuggingFace API. It is used for multiturn tasks like mt-bench.
< > Update on GitHub