iiiorg
/

piiranha-v1-detect-personal-information

@@ -4,6 +4,11 @@ license: mit
 base_model: microsoft/mdeberta-v3-base
 tags:
 - generated_from_trainer
 metrics:
 - precision
 - recall
@@ -12,20 +17,51 @@ metrics:
 model-index:
 - name: piiranha-1
   results: []
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
-# piiranha-1
-This model is a fine-tuned version of [microsoft/mdeberta-v3-base](https://huggingface.co/microsoft/mdeberta-v3-base) on an unknown dataset.
-It achieves the following results on the evaluation set:
 - Loss: 0.0173
-- Precision: 0.9316
-- Recall: 0.9308
-- F1: 0.9312
-- Accuracy: 0.9944
 ## Model description
@@ -104,4 +140,4 @@ The following hyperparameters were used during training:
 - Transformers 4.44.2
 - Pytorch 2.4.1+cu121
 - Datasets 3.0.0
-- Tokenizers 0.19.1

 base_model: microsoft/mdeberta-v3-base
 tags:
 - generated_from_trainer
+- pii
+- privacy
+- personaldata
+- redaction
+- piidetection
 metrics:
 - precision
 - recall
 model-index:
 - name: piiranha-1
   results: []
+datasets:
+- ai4privacy/pii-masking-400k
+language:
+- en
+- it
+- fr
+- de
+- nl
+- es
+pipeline_tag: token-classification
 ---
+# piiranha-v1
+Piiranha is trained to detect 17 types of Personally Identifiable Information (PII) across six languages. It successfully catches 98.27% of PII tokens, with an overall classification accuracy of 99.44%.
+Supported languages: English, Spanish, French, German, Italian, Dutch
+Supported PII types: Account Number, Building Number, City, Credit Card Number, Date of Birth, Driver's License, Email, First Name, Last Name, ID Card, Password, Social Security Number, Street Address, Tax Number, Phone Number, Username, Zipcode.
+      ACCOUNTNUM       0.84      0.87      0.85      3575
+     BUILDINGNUM       0.92      0.90      0.91      3252
+            CITY       0.95      0.97      0.96      7270
+CREDITCARDNUMBER       0.94      0.96      0.95      2308
+     DATEOFBIRTH       0.93      0.85      0.89      3389
+DRIVERLICENSENUM       0.96      0.96      0.96      2244
+           EMAIL       1.00      1.00      1.00      6892
+       GIVENNAME       0.87      0.93      0.90     12150
+       IDCARDNUM       0.89      0.94      0.91      3700
+        PASSWORD       0.98      0.98      0.98      2387
+       SOCIALNUM       0.93      0.94      0.93      2709
+          STREET       0.97      0.95      0.96      3331
+         SURNAME       0.89      0.78      0.83      8267
+          TAXNUM       0.97      0.89      0.93      2322
+    TELEPHONENUM       0.99      1.00      0.99      5039
+        USERNAME       0.98      0.98      0.98      7680
+         ZIPCODE       0.94      0.97      0.95      3191
+It is a fine-tuned version of [microsoft/mdeberta-v3-base](https://huggingface.co/microsoft/mdeberta-v3-base).
+It achieves the following results on a test set of ~73,000 sentences containing PII:
+- Accuracy: 99.44%
 - Loss: 0.0173
+- Precision: 93.16%
+- Recall: 93.08%
+- F1: 93.12%
 ## Model description
 - Transformers 4.44.2
 - Pytorch 2.4.1+cu121
 - Datasets 3.0.0
+- Tokenizers 0.19.1