danlou commited on
Commit
64e071b
·
verified ·
1 Parent(s): b52f758

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -5
README.md CHANGED
@@ -32,7 +32,7 @@ LLMs can be more than smart assistants. In fact, they should have the potential
32
  Relay is focused on a particular pattern that should be relatively frequent in pre-training datasets: IRC chats. IRC provides a rich context for conversational modeling, combining natural dialogue with command-based interactions. Yet, it remains largely overlooked.
33
  We found that base LLMs, as small as 12B, can be sufficiently familiar with the basic formatting of IRC to enable the generation of synthetic conversational datasets (see [based-chat-v0.1](https://huggingface.co/datasets/danlou/based-chat-v0.1-Mistral-Nemo-Base-2407)). These synthetic conversations can then be used to fine-tune LLMs towards unlocking reliable turn-based dialogue, within an implicit IRC context that supports use of commands as well.
34
 
35
- Assuming the model used for fine-tuning is the same used for the synthetic dataset, this conversational model is essentially trained with self-supervision (except for conversation starters): no instruct datasets or reward methods. The fine-tuning approach is also lightweight: 4-bit QLoRa (see [Fine-tuning Setup](#fine-tuning-setup)).
36
  Nevertheless, Relay can simulate more natural conversations (it’s not an assistant), besides several other applications through creative use of commands (see [How to use](#how-to-use)).
37
 
38
  Post-training methods also support the safety and alignment of LLMs. This important concern was also addressed in the development of the based-chat synthetic dataset, and tested for with the resulting fine-tuned model (see [Safety testing](#safety-testing)).
@@ -85,14 +85,14 @@ More examples available in the [project's GitHub](https://github.com/danlou/rela
85
  While this model is intended for research purposes, it's still relevant to explore how this conversational model (and its self-supervised approach) compare on safety risk against other conversational models trained on the same base LLM.
86
 
87
  This safety risk was evaluated by measuring refusals on sets of harmful questions compiled specifically for testing safety alignment of LLMs, namely [HarmfulQA](https://huggingface.co/datasets/declare-lab/HarmfulQA) and [CategoricalHarmfulQA](https://huggingface.co/datasets/declare-lab/CategoricalHarmfulQA).
88
- For this comparison, we also evaluated [Mistral-Nemo-Instruct-2407](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407), [dolphin-2.9.3-mistral-nemo-12](https://huggingface.co/cognitivecomputations/dolphin-2.9.3-mistral-nemo-12b) and [Mistral-Nemo-Instruct-2407-abliterated](https://huggingface.co/natong19/Mistral-Nemo-Instruct-2407-abliterated).
89
- Responses were generated by greedy search, with models loaded as bfloat16. Refusal responses were detected using [Llama-Guard-3-8B](https://huggingface.co/meta-llama/Llama-Guard-3-8B). The code for this evaluation is available at the [project's GitHub](https://github.com/danlou/relay).
90
 
91
- As can be seen in plot below, Relay v0.1 refuses to answer the majority of these harmful questions, and more often than popular uncensored models trained on the same base LLM.
92
 
93
  <img src="https://cdn-uploads.huggingface.co/production/uploads/60f808c5c1adf9100f1f263c/0m-dMagE7yKy1V-EB-fJ3.png" width="800"/>
94
 
95
- TODO
96
 
97
  ## Fine-tuning setup
98
 
@@ -106,6 +106,7 @@ TODO
106
 
107
  This model is licensed under [CC-BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/deed.en).
108
  While [Mistral-Nemo-Base-2407](https://huggingface.co/mistralai/Mistral-Nemo-Base-2407) is licensed under Apache 2.0, this Relay fine-tune is trained with a CC-BY-NC 4.0 dataset ([based-chat-v0.1](https://huggingface.co/datasets/danlou/based-chat-v0.1-Mistral-Nemo-Base-2407)).
 
109
 
110
  ## Citation
111
 
 
32
  Relay is focused on a particular pattern that should be relatively frequent in pre-training datasets: IRC chats. IRC provides a rich context for conversational modeling, combining natural dialogue with command-based interactions. Yet, it remains largely overlooked.
33
  We found that base LLMs, as small as 12B, can be sufficiently familiar with the basic formatting of IRC to enable the generation of synthetic conversational datasets (see [based-chat-v0.1](https://huggingface.co/datasets/danlou/based-chat-v0.1-Mistral-Nemo-Base-2407)). These synthetic conversations can then be used to fine-tune LLMs towards unlocking reliable turn-based dialogue, within an implicit IRC context that supports use of commands as well.
34
 
35
+ Assuming the model used for fine-tuning is the same used for the synthetic dataset, this conversational model is essentially trained with self-supervision (except for conversation starters): no instruct datasets or reward methods. The fine-tuning approach is also lightweight: 4-bit QLoRa (see [Fine-tuning setup](#fine-tuning-setup)).
36
  Nevertheless, Relay can simulate more natural conversations (it’s not an assistant), besides several other applications through creative use of commands (see [How to use](#how-to-use)).
37
 
38
  Post-training methods also support the safety and alignment of LLMs. This important concern was also addressed in the development of the based-chat synthetic dataset, and tested for with the resulting fine-tuned model (see [Safety testing](#safety-testing)).
 
85
  While this model is intended for research purposes, it's still relevant to explore how this conversational model (and its self-supervised approach) compare on safety risk against other conversational models trained on the same base LLM.
86
 
87
  This safety risk was evaluated by measuring refusals on sets of harmful questions compiled specifically for testing safety alignment of LLMs, namely [HarmfulQA](https://huggingface.co/datasets/declare-lab/HarmfulQA) and [CategoricalHarmfulQA](https://huggingface.co/datasets/declare-lab/CategoricalHarmfulQA).
88
+ For comparison, we also evaluated [Mistral-Nemo-Instruct-2407](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407), [dolphin-2.9.3-mistral-nemo-12](https://huggingface.co/cognitivecomputations/dolphin-2.9.3-mistral-nemo-12b) and [Mistral-Nemo-Instruct-2407-abliterated](https://huggingface.co/natong19/Mistral-Nemo-Instruct-2407-abliterated).
89
+ Responses were generated by greedy search, with models loaded as bfloat16. Refusal responses were detected using [Llama-Guard-3-8B](https://huggingface.co/meta-llama/Llama-Guard-3-8B). See [here](https://github.com/danlou/relay/tree/main/safety) for evaluation code.
90
 
91
+ As can be seen in plot below, Relay v0.1 refuses to answer the majority of these harmful questions, and more often than popular uncensored models trained on the same base LLM. Still, it does not refuse as frequently as Mistral's Instruct fine-tune (or other official LLMs), suggesting lower chance of false positives (harmfulness of several questions is not consensual).
92
 
93
  <img src="https://cdn-uploads.huggingface.co/production/uploads/60f808c5c1adf9100f1f263c/0m-dMagE7yKy1V-EB-fJ3.png" width="800"/>
94
 
95
+ It's also worth noting that some refusals are a variation of "I don't know". As seen with all LLMs, bad actors may be able to find ways around refusals.
96
 
97
  ## Fine-tuning setup
98
 
 
106
 
107
  This model is licensed under [CC-BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/deed.en).
108
  While [Mistral-Nemo-Base-2407](https://huggingface.co/mistralai/Mistral-Nemo-Base-2407) is licensed under Apache 2.0, this Relay fine-tune is trained with a CC-BY-NC 4.0 dataset ([based-chat-v0.1](https://huggingface.co/datasets/danlou/based-chat-v0.1-Mistral-Nemo-Base-2407)).
109
+ The `relaylm.py` script is Apache 2.0.
110
 
111
  ## Citation
112