Model Overview
Description:
NemoGuard JailbreakDetect was developed to detect attempts to jailbreak large language models.
This model is ready for commercial use.
License/Terms of Use:
NVIDIA Open Model License
Reference(s):
Improved Large Language Model Jailbreak Detection via Pretrained Embeddings
Model Architecture:
Architecture Type: Random Forest
Network Architecture: N/A
Input:
Input Type(s): Text Embedding
Input Parameters: 768 dimensional vector
Input Format(s): Vector
Other Properties Related to Input: Must be an output from the corresponding embedding model, snowflake-arctic-m-long
.
Output:
Output Type(s): Classification, Probability
Output Format: Bool, Float
Output Parameters: 1D
Other Properties Related to Output: N/A
Software Integration:
Runtime Engine(s):
- Not Applicable (N/A)
Supported Hardware Microarchitecture Compatibility:
- x86
- x64
[Preferred/Supported] Operating System(s):
- Windows
- MacOS
- Linux
Model Version(s):
NemoGuard-JailbreakDetect-v1.0: Jailbreak detection model using Snowflake-arctic-embed-m embeddings
Training, Testing, and Evaluation Datasets:
Training Dataset:
A combination of three open datasets, mixed together, de-duplicated, and reviewed for data quality. Jailbreak data was augmented with the use of garak. The datasets used are outlined below:
Advbench
Link: https://github.com/thunlp/Advbench
** Data Collection Method by dataset
- [Automated]
** Labeling Method by dataset
- [Automated]
Properties:
520 entries, all comprised of jailbreak attempts.
Wildjailbreak
Link: https://huggingface.co/datasets/allenai/wildjailbreak
** Data Collection Method by dataset
- Hybrid: Automated, Synthetic
** Labeling Method by dataset
- [Automated]
Properties:
6387 total entries: 5721 benign prompts, 666 jailbreak attempts
jackhao/jailbreak-classification
Link: https://huggingface.co/datasets/jackhhao/jailbreak-classification
** Data Collection Method by dataset
- [Automated]
** Labeling Method by dataset
- [Automated]
Properties:
1306 total entries: 640 benign prompts, 666 jailbreak attempts
Testing Dataset:
A stratified subset (20%) of the aggregate dataset was used for testing.
Evaluation Dataset:
Evaluated on JailbreakHub.
Model | F1 Score | False Positive Rate | False Negative Rate |
---|---|---|---|
NemoGuard JailbreakDetect | 0.9601 | 0.0042 | 0.0435 |
Inference:
Engine: N/A
Test Hardware:
- RTX A6000
- A100
Ethical Considerations:
NVIDIA believes Trustworthy AI is a shared responsibility, and we have established policies and practices to enable development for a wide array of AI applications.
When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards [Insert Link to Model Card++ here]. Please report security vulnerabilities or NVIDIA AI Concerns here
- Downloads last month
- 5