--- language: en license: apache-2.0 base_model: google/flan-t5-base pipline_tag: text-classficiation --- In Loving memory of Simon Mark Hughes... HHEM-2.1-Open is a major upgrade to [HHEM-1.0-Open](https://huggingface.co/vectara/hallucination_evaluation_model/tree/hhem-1.0-open) created by [Vectara](https://vectara.com) in November 2023. The HHEM model series are designed for detecting hallucinations in LLMs. They are particularly useful in the context of building retrieval-augmented-generation (RAG) applications where a set of facts is summarized by an LLM, and HHEM can be used to measure the extent to which this summary is factually consistent with the facts. If you are interested to learn more about RAG or experiment with Vectara, you can [sign up](https://console.vectara.com/signup/?utm_source=huggingface&utm_medium=space&utm_term=hhem-model&utm_content=console&utm_campaign=) for a free Vectara account. ## Hallucination Detection 101 By "hallucinated" or "factually inconsistent", we mean that a text (hypothesis, to be judged) is not supported by another text (evidence/premise, given). You **always need two** pieces of text to determine whether a text is hallucinated or not. When applied to RAG (retrieval augmented generation), the LLM is provided with several pieces of text (often called facts or context) retrieved from some dataset, and a hallucination would indicate that the summary (hypothesis) is not supported by those facts (evidence). A common type of hallucination in RAG is **factual but hallucinated**. For example, given the premise _"The capital of France is Berlin"_, the hypothesis _"The capital of France is Paris"_ is hallucinated -- although it is true in the world knowledge. This happens when LLMs do not generate content based on the textual data provided to them as part of the RAG retrieval process, but rather generate content based on their pre-trained knowledge. ## Using HHEM-2.1-Open with `transformers` HHEM-2.1 has some breaking change from HHEM-1.0. Your previous code will not work anymore. While we are working on backward compatibility, please follow the new usage instructions below. **Using with `Auto` class** HHEM-2.1-Open can be loaded easily using the `transformers` library. Just remember to set `trust_remote_code=True` to take advantage of the pre-/post-processing code we provided for your convenience. The **input** of the model is a list of pairs of (premise, hypothesis). For each pair, the model will **return** a score between 0 and 1, where 0 means that the hypothesis is not evidenced at all by the premise and 1 means the hypothesis is fully supported by the premise. ```python from transformers import AutoModelForSequenceClassification # Load the model model = AutoModelForSequenceClassification.from_pretrained( 'vectara/hallucination_evaluation_model', trust_remote_code=True) pairs = [ # Test data, List[Tuple[str, str]] ("The capital of France is Berlin.", "The capital of France is Paris."), # factual but hallucinated ('I am in California', 'I am in United States.'), # Consistent ('I am in United States', 'I am in California.'), # Hallucinated ("A person on a horse jumps over a broken down airplane.", "A person is outdoors, on a horse."), ("A boy is jumping on skateboard in the middle of a red bridge.", "The boy skates down the sidewalk on a red bridge"), ("A man with blond-hair, and a brown shirt drinking out of a public water fountain.", "A blond man wearing a brown shirt is reading a book."), ("Mark Wahlberg was a fan of Manny.", "Manny was a fan of Mark Wahlberg.") ] # Use the model to predict model.predict(pairs) # note the predict() method. Do not do model(pairs). # tensor([0.0111, 0.6474, 0.1290, 0.8969, 0.1846, 0.0050, 0.0543]) ``` **Using with `text-classification` pipeline** Please note that when using `text-classification` pipeline for prediction, scores for two labels will be returned for each pair. The score for **consistent** label is the one that should be focused on. ```python from transformers import pipeline, AutoTokenizer pairs = [ ("The capital of France is Berlin.", "The capital of France is Paris."), ('I am in California', 'I am in United States.'), ('I am in United States', 'I am in California.'), ("A person on a horse jumps over a broken down airplane.", "A person is outdoors, on a horse."), ("A boy is jumping on skateboard in the middle of a red bridge.", "The boy skates down the sidewalk on a red bridge"), ("A man with blond-hair, and a brown shirt drinking out of a public water fountain.", "A blond man wearing a brown shirt is reading a book."), ("Mark Wahlberg was a fan of Manny.", "Manny was a fan of Mark Wahlberg.") ] # Apply prompt to pairs prompt = " Determine if the hypothesis is true given the premise?\n\nPremise: {text1}\n\nHypothesis: {text2}" input_pairs = [prompt.format(text1=pair[0], text2=pair[1]) for pair in pairs] # Use text-classification pipeline to predict classifier = pipeline( "text-classification", model='vectara/hallucination_evaluation_model', tokenizer=AutoTokenizer.from_pretrained('google/flan-t5-base'), trust_remote_code=True ) classifier(input_pairs, return_all_scores=True) # output # [[{'label': 'hallucinated', 'score': 0.9889384508132935}, # {'label': 'consistent', 'score': 0.011061512865126133}], # [{'label': 'hallucinated', 'score': 0.35263675451278687}, # {'label': 'consistent', 'score': 0.6473632454872131}], # [{'label': 'hallucinated', 'score': 0.870982825756073}, # {'label': 'consistent', 'score': 0.1290171593427658}], # [{'label': 'hallucinated', 'score': 0.1030581071972847}, # {'label': 'consistent', 'score': 0.8969419002532959}], # [{'label': 'hallucinated', 'score': 0.8153750896453857}, # {'label': 'consistent', 'score': 0.18462494015693665}], # [{'label': 'hallucinated', 'score': 0.9949689507484436}, # {'label': 'consistent', 'score': 0.005031010136008263}], # [{'label': 'hallucinated', 'score': 0.9456764459609985}, # {'label': 'consistent', 'score': 0.05432349815964699}]] ``` You may run into a warning message that "Token indices sequence length is longer than the specified maximum sequence length". Please ignore this warning for now. It is a notification inherited from the foundation, T5-base. Note that the order of a pair is important. For example, the 2nd and 3rd examples in the `pairs` list are consistent and hallucinated, respectively. ## HHEM-2.1-Open vs. HHEM-1.0 The major difference between HHEM-2.1-Open and the original HHEM-1.0 is that HHEM-2.1-Open has an unlimited context length, while HHEM-1.0 is capped at 512 tokens. The longer context length allows HHEM-2.1-Open to provide more accurate hallucination detection for RAG which often needs more than 512 tokens. The tables below compare the two models on the [AggreFact](https://arxiv.org/pdf/2205.12854) and [RAGTruth](https://arxiv.org/abs/2401.00396) benchmarks. In particualr, on AggreFact, we focus on its SOTA subset (denoted as `AggreFact-SOTA`) which contains summaries generated by Google's T5, Meta's BART, and Google's Pegasus, which are the three latest models in the AggreFact benchmark. The results on RAGTruth's summarization (denoted as `RAGTruth-Summ`) and QA (denoted as `RAGTruth-QA`) subsets are reported separately. Table 1: Performance on AggreFact-SOTA | model | Balanced Accuracy | F1 | Recall | Precision | |:----------------------|---------:|-------:|-------:|----------:| | HHEM-1.0 | 0.7887 | 0.9047 | 0.7081 | 0.6728 | | HHEM-2.1-Open | 0.7655 | 0.6677 | 0.6848 | 0.6513 | Table 2: Performance on RAGTruth-Summ | model | Balanced Accuracy | F1 | Recall | Precision | |:----------------------|---------:|-----------:|----------:|----------:| | HHEM-1.0 | 0.5336 | 0.1577 | 0.0931 | 0.5135 | | HHEM-2.1-Open | 0.6442 | 0.4883 | 0.3186 | 0.7558 | Table 3: Performance on RAGTruth-QA | model | Balanced Accuracy | F1 | Recall | Precision | |:----------------------|---------:|-----------:|----------:|----------:| | HHEM-1.0 | 0.5258 | 0.1940 | 0.1625 | 0.2407 | | HHEM-2.1-Open | 0.7428 | 0.6000 | 0.5438 | 0.6692 | The tables above show that HHEM-2.1-Open has a significant improvement over HHEM-1.0 in the RAGTruth-Summ and RAGTruth-QA benchmarks, while it has a slight decrease in the AggreFact-SOTA benchmark. However when intepreting these results, please note that AggreFact-SOTA is evaluated on relatively older types of LLMs: - LLMs in AggreFact-SOTA: T5, BART, and Pegasus; - LLMs in RAGTruth: GPT-4-0613, GPT-3.5-turbo-0613, Llama-2-7B/13B/70B-chat, and Mistral-7B-instruct. Therefore, we conclude that HHEM-2.1-Open is better than HHEM-1.0. ## Want something more powerful? As you may have already sensed from the name, HHEM-2.1-Open is the open source version of the premium HHEM-2.1. HHEM-2.1 (without the `-Open`) is offered exclusively via Vectara's RAG-as-a-service platform. The major difference between HHEM-2.1 and HHEM-2.1-Open is that HHEM-2.1 is cross-lingual on three languages: English, German, and French, while HHEM-2.1-Open is English-only. "Cross-lingual" means any combination of the three languages, e.g., documents in German, query in English, results in French. ### Why RAG in Vectara? Vectara provides a Trusted Generative AI platform. The platform allows organizations to rapidly create an AI assistant experience which is grounded in the data, documents, and knowledge that they have. Vectara's serverless RAG-as-a-Service also solves critical problems required for enterprise adoption, namely: reduces hallucination, provides explainability / provenance, enforces access control, allows for real-time updatability of the knowledge, and mitigates intellectual property / bias concerns from large language models. To start benefiting from HHEM-2.1, you can [sign up](https://console.vectara.com/signup/?utm_source=huggingface&utm_medium=space&utm_term=hhem-model&utm_content=console&utm_campaign=) for a free Vectara account, and you will get the HHEM-2.1 score returned with every query automatically. Here are some additional resources: 1. Vectara [API documentation](https://docs.vectara.com/docs). 2. Quick start using Forrest's [vectara-python-cli](https://vectara-python-cli.readthedocs.io/en/latest/crash_course.html). 3. Learn more about Vectara's [Boomerang embedding model](https://vectara.com/blog/introducing-boomerang-vectaras-new-and-improved-retrieval-model/), [Slingshot reranker](https://vectara.com/blog/deep-dive-into-vectara-multilingual-reranker-v1-state-of-the-art-reranker-across-100-languages/), and [Mockingbird LLM](https://vectara.com/blog/mockingbird-a-rag-and-structured-output-focused-llm/) ## LLM Hallucination Leaderboard If you want to stay up to date with results of the latest tests using this model to evaluate the top LLM models, we have a [public leaderboard](https://huggingface.co/spaces/vectara/leaderboard) that is periodically updated, and results are also available on the [GitHub repository](https://github.com/vectara/hallucination-leaderboard).