Model Overview

Description:

NemoGuard JailbreakDetect was developed to detect attempts to jailbreak large language models. This model is ready for commercial use.

License/Terms of Use:

NVIDIA Open Model License

Reference(s):

Improved Large Language Model Jailbreak Detection via Pretrained Embeddings

Model Architecture:

Architecture Type: Random Forest
Network Architecture: N/A

Input:

Input Type(s): Text Embedding
Input Parameters: 768 dimensional vector
Input Format(s): Vector
Other Properties Related to Input: Must be an output from the corresponding embedding model, snowflake-arctic-m-long.

Output:

Output Type(s): Classification, Probability
Output Format: Bool, Float
Output Parameters: 1D
Other Properties Related to Output: N/A

Software Integration:

Runtime Engine(s):

Not Applicable (N/A)

Supported Hardware Microarchitecture Compatibility:

[Preferred/Supported] Operating System(s):

Windows
MacOS
Linux

Model Version(s):

NemoGuard-JailbreakDetect-v1.0: Jailbreak detection model using Snowflake-arctic-embed-m embeddings

Training, Testing, and Evaluation Datasets:

Training Dataset:

A combination of three open datasets, mixed together, de-duplicated, and reviewed for data quality. Jailbreak data was augmented with the use of garak. The datasets used are outlined below:

Advbench

Link: https://github.com/thunlp/Advbench
** Data Collection Method by dataset

[Automated]

** Labeling Method by dataset

[Automated]

Properties:
520 entries, all comprised of jailbreak attempts.

Wildjailbreak

Link: https://huggingface.co/datasets/allenai/wildjailbreak
** Data Collection Method by dataset

Hybrid: Automated, Synthetic

** Labeling Method by dataset

[Automated]

Properties:
6387 total entries: 5721 benign prompts, 666 jailbreak attempts

jackhao/jailbreak-classification

Link: https://huggingface.co/datasets/jackhhao/jailbreak-classification
** Data Collection Method by dataset

[Automated]

** Labeling Method by dataset

[Automated]

Properties:
1306 total entries: 640 benign prompts, 666 jailbreak attempts

Testing Dataset:

A stratified subset (20%) of the aggregate dataset was used for testing.

Evaluation Dataset:

Evaluated on JailbreakHub.

Model	F1 Score	False Positive Rate	False Negative Rate
NemoGuard JailbreakDetect	0.9601	0.0042	0.0435

Inference:

Engine: N/A
Test Hardware:

RTX A6000
A100

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility, and we have established policies and practices to enable development for a wide array of AI applications.
When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards [Insert Link to Model Card++ here]. Please report security vulnerabilities or NVIDIA AI Concerns here

nvidia
/

NemoGuard-JailbreakDetect

Model Overview

Description:

License/Terms of Use:

Reference(s):

Model Architecture:

Input:

Output:

Software Integration:

Model Version(s):

Training, Testing, and Evaluation Datasets:

Training Dataset:

Advbench

Wildjailbreak

jackhao/jailbreak-classification

Testing Dataset:

Evaluation Dataset:

Inference:

Ethical Considerations:

Collection including nvidia/NemoGuard-JailbreakDetect

NemoGuard