vectara
/

hallucination_evaluation_model

Text Classification

Transformers

Safetensors

English

HHEMv2Config

custom_code

Model card Files Files and versions Community

Forrest Bao commited on Aug 7, 2024

Commit

8c8021f

2 Parent(s): 9c966da d6fdaee

merge miaoran into forrest

Browse files

Files changed (3) hide show

README.md +54 -1
config.json +2 -1
modeling_hhem_v2.py +9 -2

README.md CHANGED Viewed

@@ -26,7 +26,11 @@ By "hallucinated" or "factually inconsistent", we mean that a text (hypothesis,
 A common type of hallucination in RAG is **factual but hallucinated**.
 For example, given the premise _"The capital of France is Berlin"_, the hypothesis _"The capital of France is Paris"_ is hallucinated -- although it is true in the world knowledge. This happens when LLMs do not generate content based on the textual data provided to them as part of the RAG retrieval process, but rather generate content based on their pre-trained knowledge.
-## Using HHEM-2.1-Open
 HHEM-2.1 has some breaking change from HHEM-1.0. Your previous code will not work anymore. While we are working on backward compatibility, please follow the new usage instructions below.
@@ -54,6 +58,55 @@ model.predict(pairs) # note the predict() method. Do not do model(pairs).
 # tensor([0.0111, 0.6474, 0.1290, 0.8969, 0.1846, 0.0050, 0.0543])
 ```
 You may run into a warning message that "Token indices sequence length is longer than the specified maximum sequence length". Please ignore this warning for now. It is a notification inherited from the foundation, T5-base.
 Note that the order of a pair is important. For example, the 2nd and 3rd examples in the `pairs` list are consistent and hallucinated, respectively.

 A common type of hallucination in RAG is **factual but hallucinated**.
 For example, given the premise _"The capital of France is Berlin"_, the hypothesis _"The capital of France is Paris"_ is hallucinated -- although it is true in the world knowledge. This happens when LLMs do not generate content based on the textual data provided to them as part of the RAG retrieval process, but rather generate content based on their pre-trained knowledge.
+## Using HHEM-2.1-Open with `transformers`
+HHEM-2.1 has some breaking change from HHEM-1.0. Your previous code will not work anymore. While we are working on backward compatibility, please follow the new usage instructions below.
+**Using with `Auto` class**
 HHEM-2.1 has some breaking change from HHEM-1.0. Your previous code will not work anymore. While we are working on backward compatibility, please follow the new usage instructions below.
 # tensor([0.0111, 0.6474, 0.1290, 0.8969, 0.1846, 0.0050, 0.0543])
 ```
+**Using with `text-classification` pipeline**
+Please note that when using `text-classification` pipeline for prediction, scores for two labels will be returned for each pair. The score for **consistent** label is the one that should be focused on.
+```python
+from transformers import pipeline, AutoTokenizer
+pairs = [
+    ("The capital of France is Berlin.", "The capital of France is Paris."),
+    ('I am in California', 'I am in United States.'),
+    ('I am in United States', 'I am in California.'),
+    ("A person on a horse jumps over a broken down airplane.", "A person is outdoors, on a horse."),
+    ("A boy is jumping on skateboard in the middle of a red bridge.", "The boy skates down the sidewalk on a red bridge"),
+    ("A man with blond-hair, and a brown shirt drinking out of a public water fountain.", "A blond man wearing a brown shirt is reading a book."),
+    ("Mark Wahlberg was a fan of Manny.", "Manny was a fan of Mark Wahlberg.")
+]
+# Apply prompt to pairs
+prompt = "<pad> Determine if the hypothesis is true given the premise?\n\nPremise: {text1}\n\nHypothesis: {text2}"
+input_pairs = [prompt.format(text1=pair[0], text2=pair[1]) for pair in pairs]
+# Use text-classification pipeline to predict
+classifier = pipeline(
+            "text-classification",
+            model='vectara/hallucination_evaluation_model',
+            tokenizer=AutoTokenizer.from_pretrained('google/flan-t5-base'),
+            trust_remote_code=True
+        )
+classifier(input_pairs, return_all_scores=True)
+# output
+# [[{'label': 'hallucinated', 'score': 0.9889384508132935},
+#   {'label': 'consistent', 'score': 0.011061512865126133}],
+#  [{'label': 'hallucinated', 'score': 0.35263675451278687},
+#   {'label': 'consistent', 'score': 0.6473632454872131}],
+#  [{'label': 'hallucinated', 'score': 0.870982825756073},
+#   {'label': 'consistent', 'score': 0.1290171593427658}],
+#  [{'label': 'hallucinated', 'score': 0.1030581071972847},
+#   {'label': 'consistent', 'score': 0.8969419002532959}],
+#  [{'label': 'hallucinated', 'score': 0.8153750896453857},
+#   {'label': 'consistent', 'score': 0.18462494015693665}],
+#  [{'label': 'hallucinated', 'score': 0.9949689507484436},
+#   {'label': 'consistent', 'score': 0.005031010136008263}],
+#  [{'label': 'hallucinated', 'score': 0.9456764459609985},
+#   {'label': 'consistent', 'score': 0.05432349815964699}]]
+```
 You may run into a warning message that "Token indices sequence length is longer than the specified maximum sequence length". Please ignore this warning for now. It is a notification inherited from the foundation, T5-base.
 Note that the order of a pair is important. For example, the 2nd and 3rd examples in the `pairs` list are consistent and hallucinated, respectively.

config.json CHANGED Viewed

@@ -8,5 +8,6 @@
   },
   "model_type": "HHEMv2Config",
   "torch_dtype": "float32",
-  "transformers_version": "4.39.3"
 }

   },
   "model_type": "HHEMv2Config",
   "torch_dtype": "float32",
+  "transformers_version": "4.39.3",
+  "id2label": {"0": "hallucinated", "1": "consistent"}
 }

modeling_hhem_v2.py CHANGED Viewed

@@ -45,8 +45,15 @@ class HHEMv2ForSequenceClassification(PreTrainedModel):
     #     combined_model = PeftModel.from_pretrained(base_model, checkpoint, is_trainable=False)
     #     self.t5 = combined_model
-    def forward(self, **kwargs):
-        return self.t5(**kwargs)
     def predict(self, text_pairs):
         tokenizer = self.tokenzier

     #     combined_model = PeftModel.from_pretrained(base_model, checkpoint, is_trainable=False)
     #     self.t5 = combined_model
+    def forward(self, **kwargs): # To cope with `text-classiication` pipeline
+        self.t5.eval()
+        with torch.no_grad():
+            outputs = self.t5(**kwargs)
+        logits = outputs.logits
+        logits = logits[:, 0, :]
+        outputs.logits = logits
+        return outputs
+        # return self.t5(**kwargs)
     def predict(self, text_pairs):
         tokenizer = self.tokenzier