Forrest Bao
commited on
Commit
·
402fb1d
1
Parent(s):
ade58fc
add comparison with GPTs and migration notice; link to web app
Browse files
README.md
CHANGED
@@ -7,11 +7,19 @@ pipline_tag: text-classficiation
|
|
7 |
|
8 |
<img src="https://huggingface.co/vectara/hallucination_evaluation_model/resolve/main/candle.png" width="50" height="50" style="display: inline;"> In Loving memory of Simon Mark Hughes...
|
9 |
|
|
|
|
|
|
|
|
|
|
|
|
|
10 |
|
11 |
HHEM-2.1-Open is a major upgrade to [HHEM-1.0-Open](https://huggingface.co/vectara/hallucination_evaluation_model/tree/hhem-1.0-open) created by [Vectara](https://vectara.com) in November 2023. The HHEM model series are designed for detecting hallucinations in LLMs. They are particularly useful in the context of building retrieval-augmented-generation (RAG) applications where a set of facts is summarized by an LLM, and HHEM can be used to measure the extent to which this summary is factually consistent with the facts.
|
12 |
|
13 |
If you are interested to learn more about RAG or experiment with Vectara, you can [sign up](https://console.vectara.com/signup/?utm_source=huggingface&utm_medium=space&utm_term=hhem-model&utm_content=console&utm_campaign=) for a free Vectara account.
|
14 |
|
|
|
|
|
15 |
## Hallucination Detection 101
|
16 |
By "hallucinated" or "factually inconsistent", we mean that a text (hypothesis, to be judged) is not supported by another text (evidence/premise, given). You **always need two** pieces of text to determine whether a text is hallucinated or not. When applied to RAG (retrieval augmented generation), the LLM is provided with several pieces of text (often called facts or context) retrieved from some dataset, and a hallucination would indicate that the summary (hypothesis) is not supported by those facts (evidence).
|
17 |
|
@@ -20,6 +28,8 @@ For example, given the premise _"The capital of France is Berlin"_, the hypothes
|
|
20 |
|
21 |
## Using HHEM-2.1-Open
|
22 |
|
|
|
|
|
23 |
HHEM-2.1-Open can be loaded easily using the `transformers` library. Just remember to set `trust_remote_code=True` to take advantage of the pre-/post-processing code we provided for your convenience. The **input** of the model is a list of pairs of (premise, hypothesis). For each pair, the model will **return** a score between 0 and 1, where 0 means that the hypothesis is not evidenced at all by the premise and 1 means the hypothesis is fully supported by the premise.
|
24 |
|
25 |
```python
|
@@ -44,40 +54,58 @@ model.predict(pairs) # note the predict() method. Do not do model(pairs).
|
|
44 |
# tensor([0.0111, 0.6474, 0.1290, 0.8969, 0.1846, 0.0050, 0.0543])
|
45 |
```
|
46 |
|
47 |
-
|
|
|
|
|
48 |
|
49 |
|
50 |
## HHEM-2.1-Open vs. HHEM-1.0
|
51 |
|
52 |
The major difference between HHEM-2.1-Open and the original HHEM-1.0 is that HHEM-2.1-Open has an unlimited context length, while HHEM-1.0 is capped at 512 tokens. The longer context length allows HHEM-2.1-Open to provide more accurate hallucination detection for RAG which often needs more than 512 tokens.
|
53 |
|
54 |
-
The tables below compare the two models on the [AggreFact](https://arxiv.org/pdf/2205.12854) and [RAGTruth](https://arxiv.org/abs/2401.00396) benchmarks. In
|
55 |
|
56 |
Table 1: Performance on AggreFact-SOTA
|
57 |
| model | Balanced Accuracy | F1 | Recall | Precision |
|
58 |
-
|
59 |
-
| HHEM-1.0
|
60 |
-
| HHEM-2.1-Open
|
|
|
|
|
61 |
|
62 |
Table 2: Performance on RAGTruth-Summ
|
63 |
| model | Balanced Accuracy | F1 | Recall | Precision |
|
64 |
|:----------------------|---------:|-----------:|----------:|----------:|
|
65 |
-
| HHEM-1.0 |
|
66 |
-
| HHEM-2.1-Open |
|
|
|
|
|
67 |
|
68 |
Table 3: Performance on RAGTruth-QA
|
69 |
| model | Balanced Accuracy | F1 | Recall | Precision |
|
70 |
|:----------------------|---------:|-----------:|----------:|----------:|
|
71 |
-
| HHEM-1.0 |
|
72 |
-
| HHEM-2.1-Open |
|
|
|
|
|
73 |
|
74 |
-
The tables above show that HHEM-2.1-Open has a significant improvement over HHEM-1.0 in the RAGTruth-Summ and RAGTruth-QA benchmarks, while it has a slight decrease in the AggreFact-SOTA benchmark. However when
|
75 |
- LLMs in AggreFact-SOTA: T5, BART, and Pegasus;
|
76 |
- LLMs in RAGTruth: GPT-4-0613, GPT-3.5-turbo-0613, Llama-2-7B/13B/70B-chat, and Mistral-7B-instruct.
|
77 |
|
78 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
79 |
|
80 |
-
##
|
81 |
|
82 |
As you may have already sensed from the name, HHEM-2.1-Open is the open source version of the premium HHEM-2.1. HHEM-2.1 (without the `-Open`) is offered exclusively via Vectara's RAG-as-a-service platform. The major difference between HHEM-2.1 and HHEM-2.1-Open is that HHEM-2.1 is cross-lingual on three languages: English, German, and French, while HHEM-2.1-Open is English-only. "Cross-lingual" means any combination of the three languages, e.g., documents in German, query in English, results in French.
|
83 |
|
|
|
7 |
|
8 |
<img src="https://huggingface.co/vectara/hallucination_evaluation_model/resolve/main/candle.png" width="50" height="50" style="display: inline;"> In Loving memory of Simon Mark Hughes...
|
9 |
|
10 |
+
**Highlights**:
|
11 |
+
* HHEM-2.1-Open shows a significant improvement over HHEM-1.0.
|
12 |
+
* HHEM-2.1-Open outperforms GPT-3.5-Turbo and even GPT-4.
|
13 |
+
* HHEM-2.1-Open can be run on consumer-grade hardware, occupying less than 600MB RAM space at 32-bit precision and elapsing around 1.5 seconds for a 2k-token input on a modern x86 CPU.
|
14 |
+
|
15 |
+
**To HHEM-1.0 users**: HHEM-2.1-Open introduces breaking changes to the usage. Please update your code according to the [new usage](#using-hhem-21-open) below. We are working making it compatible with `transformers.pipeline` and HuggingFace's Inference Endpoint. We apologize for the inconvenience.
|
16 |
|
17 |
HHEM-2.1-Open is a major upgrade to [HHEM-1.0-Open](https://huggingface.co/vectara/hallucination_evaluation_model/tree/hhem-1.0-open) created by [Vectara](https://vectara.com) in November 2023. The HHEM model series are designed for detecting hallucinations in LLMs. They are particularly useful in the context of building retrieval-augmented-generation (RAG) applications where a set of facts is summarized by an LLM, and HHEM can be used to measure the extent to which this summary is factually consistent with the facts.
|
18 |
|
19 |
If you are interested to learn more about RAG or experiment with Vectara, you can [sign up](https://console.vectara.com/signup/?utm_source=huggingface&utm_medium=space&utm_term=hhem-model&utm_content=console&utm_campaign=) for a free Vectara account.
|
20 |
|
21 |
+
[**Try out HHEM-2.1-Open from your browser without coding** ](http://13.57.203.109:3000/)
|
22 |
+
|
23 |
## Hallucination Detection 101
|
24 |
By "hallucinated" or "factually inconsistent", we mean that a text (hypothesis, to be judged) is not supported by another text (evidence/premise, given). You **always need two** pieces of text to determine whether a text is hallucinated or not. When applied to RAG (retrieval augmented generation), the LLM is provided with several pieces of text (often called facts or context) retrieved from some dataset, and a hallucination would indicate that the summary (hypothesis) is not supported by those facts (evidence).
|
25 |
|
|
|
28 |
|
29 |
## Using HHEM-2.1-Open
|
30 |
|
31 |
+
HHEM-2.1 has some breaking change from HHEM-1.0. Your previous code will not work anymore. While we are working on backward compatibility, please follow the new usage instructions below.
|
32 |
+
|
33 |
HHEM-2.1-Open can be loaded easily using the `transformers` library. Just remember to set `trust_remote_code=True` to take advantage of the pre-/post-processing code we provided for your convenience. The **input** of the model is a list of pairs of (premise, hypothesis). For each pair, the model will **return** a score between 0 and 1, where 0 means that the hypothesis is not evidenced at all by the premise and 1 means the hypothesis is fully supported by the premise.
|
34 |
|
35 |
```python
|
|
|
54 |
# tensor([0.0111, 0.6474, 0.1290, 0.8969, 0.1846, 0.0050, 0.0543])
|
55 |
```
|
56 |
|
57 |
+
You may run into a warning message that "Token indices sequence length is longer than the specified maximum sequence length". Please ignore this warning for now. It is a notification inherited from the foundation, T5-base.
|
58 |
+
|
59 |
+
Note that the order of a pair is important. For example, the 2nd and 3rd examples in the `pairs` list are consistent and hallucinated, respectively.
|
60 |
|
61 |
|
62 |
## HHEM-2.1-Open vs. HHEM-1.0
|
63 |
|
64 |
The major difference between HHEM-2.1-Open and the original HHEM-1.0 is that HHEM-2.1-Open has an unlimited context length, while HHEM-1.0 is capped at 512 tokens. The longer context length allows HHEM-2.1-Open to provide more accurate hallucination detection for RAG which often needs more than 512 tokens.
|
65 |
|
66 |
+
The tables below compare the two models on the [AggreFact](https://arxiv.org/pdf/2205.12854) and [RAGTruth](https://arxiv.org/abs/2401.00396) benchmarks, as well as GPT-3.5-Turbo and GPT-4. In particular, on AggreFact, we focus on its SOTA subset (denoted as `AggreFact-SOTA`) which contains summaries generated by Google's T5, Meta's BART, and Google's Pegasus, which are the three latest models in the AggreFact benchmark. The results on RAGTruth's summarization (denoted as `RAGTruth-Summ`) and QA (denoted as `RAGTruth-QA`) subsets are reported separately. The GPT-3.5-Turbo and GPT-4 versions are 01-25 and 06-13 respectively. The zero-shot results of the two GPT models were obtained using the prompt template in [this paper](https://arxiv.org/pdf/2303.15621).
|
67 |
|
68 |
Table 1: Performance on AggreFact-SOTA
|
69 |
| model | Balanced Accuracy | F1 | Recall | Precision |
|
70 |
+
|:------------------------|---------:|-------:|-------:|----------:|
|
71 |
+
| HHEM-1.0 | 78.87% | 90.47% | 70.81% | 67.28% |
|
72 |
+
| HHEM-2.1-Open | 76.55% | 66.77% | 68.48% | 65.13% |
|
73 |
+
| GPT-3.5-Turbo zero-shot | 72.19% | 60.88% | 58.48% | 63.48% |
|
74 |
+
| GPT-4 06-13 zero-shot | 73.78% | 63.86% | 53.03% | 80.27% |
|
75 |
|
76 |
Table 2: Performance on RAGTruth-Summ
|
77 |
| model | Balanced Accuracy | F1 | Recall | Precision |
|
78 |
|:----------------------|---------:|-----------:|----------:|----------:|
|
79 |
+
| HHEM-1.0 | 53.36% | 15.77% | 9.31% | 51.35% |
|
80 |
+
| HHEM-2.1-Open | 64.42% | 44.83% | 31.86% | 75.58% |
|
81 |
+
| GPT-3.5-Turbo zero-shot | 58.49% | 29.72% | 18.14% | 82.22% |
|
82 |
+
| GPT-4 06-13 zero-shot | 62.62% | 40.59% | 26.96% | 82.09% |
|
83 |
|
84 |
Table 3: Performance on RAGTruth-QA
|
85 |
| model | Balanced Accuracy | F1 | Recall | Precision |
|
86 |
|:----------------------|---------:|-----------:|----------:|----------:|
|
87 |
+
| HHEM-1.0 | 52.58% | 19.40% | 16.25% | 24.07% |
|
88 |
+
| HHEM-2.1-Open | 74.28% | 60.00% | 54.38% | 66.92% |
|
89 |
+
| GPT-3.5-Turbo zero-shot | 56.16% | 25.00% | 18.13% | 40.28% |
|
90 |
+
| GPT-4 06-13 zero-shot | 74.11% | 57.78% | 56.88% | 58.71% |
|
91 |
|
92 |
+
The tables above show that HHEM-2.1-Open has a significant improvement over HHEM-1.0 in the RAGTruth-Summ and RAGTruth-QA benchmarks, while it has a slight decrease in the AggreFact-SOTA benchmark. However, when interpreting these results, please note that AggreFact-SOTA is evaluated on relatively older types of LLMs:
|
93 |
- LLMs in AggreFact-SOTA: T5, BART, and Pegasus;
|
94 |
- LLMs in RAGTruth: GPT-4-0613, GPT-3.5-turbo-0613, Llama-2-7B/13B/70B-chat, and Mistral-7B-instruct.
|
95 |
|
96 |
+
## HHEM-2.1-Open vs. GPT-3.5-Turbo and GPT-4
|
97 |
+
|
98 |
+
From the tables above we can also conclude that HHEM-2.1-Open outperforms both GPT-3.5-Turbo and GPT-4 in all three benchmarks. The quantitative advantage of HHEM-2.1-Open over GPT-3.5-Turbo and GPT-4 is summarized in Table 4 below.
|
99 |
+
|
100 |
+
Table 4: Percentage points of HHEM-2.1-Open's balanced accuracies over GPT-3.5-Turbo and GPT-4
|
101 |
+
| | AggreFact-SOTA | RAGTruth-Summ | RAGTruth-QA |
|
102 |
+
|:----------------------|---------:|-----------:|----------:|
|
103 |
+
| HHEM-2.1-Open **over** GPT-3.5-Turbo | 4.36% | 5.93% | 18.12% |
|
104 |
+
| HHEM-2.1-Open **over** GPT-4 | 2.64% | 1.80% | 0.17% |
|
105 |
+
|
106 |
+
Another advantage of HHEM-2.1-Open is its efficiency. HHEM-2.1-Open can be run on consumer-grade hardware, occupying less than 600MB RAM space at 32-bit precision and elapsing around 1.5 second for a 2k-token input on a modern x86 CPU.
|
107 |
|
108 |
+
## HHEM-2.1: The more powerful, proprietary counterpart of HHEM-2.1-Open
|
109 |
|
110 |
As you may have already sensed from the name, HHEM-2.1-Open is the open source version of the premium HHEM-2.1. HHEM-2.1 (without the `-Open`) is offered exclusively via Vectara's RAG-as-a-service platform. The major difference between HHEM-2.1 and HHEM-2.1-Open is that HHEM-2.1 is cross-lingual on three languages: English, German, and French, while HHEM-2.1-Open is English-only. "Cross-lingual" means any combination of the three languages, e.g., documents in German, query in English, results in French.
|
111 |
|