nace-ai
/

navi-small-preview

@@ -12,7 +12,14 @@ base_model:
 # NAVI verifiers (Nace Automated Verification Intelligence)
-NAVI (Nace Automated Verification Intelligence) is a solution for policy alignment verification designed to review various types of text against documents and policies, and identify violating content. It is specifically optimized for enterprise applications requiring compliance verification for automated text generation. To push policy verification in the open-source community, we release NAVI-small-preview, an open-weights version of the model we have deployed on the platform. NAVI-small-preview is centered around verifying specifically assitant outputs against some policy documents. The full solution is available through [NAVI platform and API](https://naviml.com/).
 - **Developed by:** Nace.AI
 - **Model type:** Policy Alignment Verifier
@@ -108,7 +115,7 @@ NAVI utilizes latest advances in Knowledge Augmentation and Memory in order to i
 #### Training Hyperparameters
-- **Training regime:** We have trained the Lora adapter using all linear modules for all Transformer layers with rank 16, alpha 32, learning rate 5e-5, effective batch size 32. Trained on 8 A100s for 6 epochs under 3 hours using Pytorch Distributed Data Parallel.
 ## Evaluation
@@ -116,7 +123,7 @@ NAVI utilizes latest advances in Knowledge Augmentation and Memory in order to i
 #### Testing Data
-We have manually collected Policy Alignment Verification (PAV) dataset consisting of across different use cases for evaluation. We open source the public subset of the dataset. Here we diclose the performance on the public subset, containing 125 examples across six industry-specific scenarios: AT&T, Airbnb, Cadence Bank, Delta Airlines, Verisk, and Walgreens.
 #### Factors
@@ -128,17 +135,21 @@ F1 score was used to measure performance, prioritizing detection of noncomplianc
 ### Results
-NAVI-small-preview achieved an F1 score of 86.8% on public subset of PAV dataset, outperforming all tested alternatives except full-scale NAVI. We evaluate against general-purpose solutions like Claude and Open AI models, as well as some guardrails focusing on groundedness to demonstrate a clear distinction of policy verification from the more common groundedness verification.
-| Model                    | F1 Score | Avg Latency (ms) |
-|--------------------------|----------|------------------|
-| NAVI-small-preview      | 86.8     | -                |
-| NAVI                    | 90.4     | 387.62           |
-| AWS Bedrock Guardrail   | 76.5     | 342.79           |
-| Azure Groundedness      | 71.2     | 232.71           |
-| NeMo (GPT-4o)           | 71.2     | 2669.68          |
-| GPT-4o (few-shot)       | 75.0     | 904.46           |
-| Sonnet 3.5 (few-shot)   | 75.5     | 2926.69          |
 ## Model Card Contact

 # NAVI verifiers (Nace Automated Verification Intelligence)
+![NAVI Cover](assets/cover.jpg)
+![NAVI Logo](assets/logo.svg)
+NAVI (Nace Automated Verification Intelligence) is a policy alignment verification solution designed to analyze text for compliance with documents and policies, identifying any violations. Optimized for enterprise applications, it supports automated compliance checks for text generation. To encourage open-source adoption, we offer NAVI-small-preview, an open-weights version of the deployed model, focused on verifying assistant outputs against policy documents. The full solution is accessible via the [NAVI platform and API](https://naviml.com/).
+The chart below illustrates NAVI's strong performance, with the full model achieving an F1 score of 90.4%, outperforming all competitors. NAVI-small-preview also demonstrates impressive results, providing an open-source option with significant improvements over baseline models while maintaining reliable policy alignment verification.
+![Results](assets/results.png)
 - **Developed by:** Nace.AI
 - **Model type:** Policy Alignment Verifier
 #### Training Hyperparameters
+- **Training regime:** We perform thorough hyperparameter search during finetuning. The resulting model is a Lora adapter that uses all linear modules for all Transformer layers with rank 16, alpha 32, learning rate 5e-5, effective batch size 32. Trained with 8 A100s for 6 epochs using Pytorch Distributed Data Parallel.
 ## Evaluation
 #### Testing Data
+We curated the Policy Alignment Verification (PAV) dataset to evaluate diverse policy verification use cases, releasing a public subset of 125 examples spanning six industry-specific scenarios: AT&T, Airbnb, Cadence Bank, Delta Airlines, Verisk, and Walgreens. This open-sourced subset ensures transparency and facilitates benchmarking of model performance. We evaluate our models and alternative solutions on this test set.
 #### Factors
 ### Results
+The table below shows performance of models evaluated on the public subset of PAV dataset. NAVI-small-preview achieved an F1 score of 86.8%, outperforming all tested alternatives except full-scale NAVI. We evaluate against general-purpose solutions like Claude and Open AI models, as well as some guardrails focusing on groundedness to demonstrate a clear distinction of policy verification from the more common groundedness verification.
+| Model                 | F1 Score (%) | Precision (%) | Recall (%) | Accuracy (%) |
+|-----------------------|--------------|---------------|------------|--------------|
+| Llama-3.1-Storm-8B    | 66.7         | 86.4          | 54.3       | 69.6         |
+| NAVI-small-preview    | 86.8         | 80.5          | 94.3       | 84.0         |
+| NAVI                  | **90.4**     | **93.8**      | **87.1**   | **89.6**     |
+| Sonnet 3.5            | 83.2         | 85.1          | 81.4       | 81.6         |
+| GPT-4o                | 80.5         | 73.8          | 88.6       | 76.0         |
+| AWS Bedrock Guardrail | 74.8         | 87.1          | 65.6       | 67.2         |
+| Azure Groundedness    | 75.0         | 62.3          | 94.3       | 64.8         |
+| NeMo (GPT-4o)         | 69.0         | 67.2          | 70.9       | 72.0         |
+\*NAVI-small-preview is not deployed anywhere, latency calculation was obtained from running PAV evaluation with vLLM on 1 80GB A100 gpu.
 ## Model Card Contact

assets/cover.jpg ADDED Viewed

assets/logo.svg ADDED Viewed

assets/results.png ADDED Viewed