Update README.md
Browse files
README.md
CHANGED
@@ -4,6 +4,11 @@ license: mit
|
|
4 |
base_model: microsoft/mdeberta-v3-base
|
5 |
tags:
|
6 |
- generated_from_trainer
|
|
|
|
|
|
|
|
|
|
|
7 |
metrics:
|
8 |
- precision
|
9 |
- recall
|
@@ -12,20 +17,51 @@ metrics:
|
|
12 |
model-index:
|
13 |
- name: piiranha-1
|
14 |
results: []
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
15 |
---
|
16 |
|
17 |
-
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
|
18 |
-
should probably proofread and complete it, then remove this comment. -->
|
19 |
|
20 |
-
# piiranha-1
|
21 |
|
22 |
-
|
23 |
-
It
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
24 |
- Loss: 0.0173
|
25 |
-
- Precision:
|
26 |
-
- Recall:
|
27 |
-
- F1:
|
28 |
-
- Accuracy: 0.9944
|
29 |
|
30 |
## Model description
|
31 |
|
@@ -104,4 +140,4 @@ The following hyperparameters were used during training:
|
|
104 |
- Transformers 4.44.2
|
105 |
- Pytorch 2.4.1+cu121
|
106 |
- Datasets 3.0.0
|
107 |
-
- Tokenizers 0.19.1
|
|
|
4 |
base_model: microsoft/mdeberta-v3-base
|
5 |
tags:
|
6 |
- generated_from_trainer
|
7 |
+
- pii
|
8 |
+
- privacy
|
9 |
+
- personaldata
|
10 |
+
- redaction
|
11 |
+
- piidetection
|
12 |
metrics:
|
13 |
- precision
|
14 |
- recall
|
|
|
17 |
model-index:
|
18 |
- name: piiranha-1
|
19 |
results: []
|
20 |
+
datasets:
|
21 |
+
- ai4privacy/pii-masking-400k
|
22 |
+
language:
|
23 |
+
- en
|
24 |
+
- it
|
25 |
+
- fr
|
26 |
+
- de
|
27 |
+
- nl
|
28 |
+
- es
|
29 |
+
pipeline_tag: token-classification
|
30 |
---
|
31 |
|
|
|
|
|
32 |
|
|
|
33 |
|
34 |
+
# piiranha-v1
|
35 |
+
Piiranha is trained to detect 17 types of Personally Identifiable Information (PII) across six languages. It successfully catches 98.27% of PII tokens, with an overall classification accuracy of 99.44%.
|
36 |
+
|
37 |
+
Supported languages: English, Spanish, French, German, Italian, Dutch
|
38 |
+
Supported PII types: Account Number, Building Number, City, Credit Card Number, Date of Birth, Driver's License, Email, First Name, Last Name, ID Card, Password, Social Security Number, Street Address, Tax Number, Phone Number, Username, Zipcode.
|
39 |
+
|
40 |
+
ACCOUNTNUM 0.84 0.87 0.85 3575
|
41 |
+
BUILDINGNUM 0.92 0.90 0.91 3252
|
42 |
+
CITY 0.95 0.97 0.96 7270
|
43 |
+
CREDITCARDNUMBER 0.94 0.96 0.95 2308
|
44 |
+
DATEOFBIRTH 0.93 0.85 0.89 3389
|
45 |
+
DRIVERLICENSENUM 0.96 0.96 0.96 2244
|
46 |
+
EMAIL 1.00 1.00 1.00 6892
|
47 |
+
GIVENNAME 0.87 0.93 0.90 12150
|
48 |
+
IDCARDNUM 0.89 0.94 0.91 3700
|
49 |
+
PASSWORD 0.98 0.98 0.98 2387
|
50 |
+
SOCIALNUM 0.93 0.94 0.93 2709
|
51 |
+
STREET 0.97 0.95 0.96 3331
|
52 |
+
SURNAME 0.89 0.78 0.83 8267
|
53 |
+
TAXNUM 0.97 0.89 0.93 2322
|
54 |
+
TELEPHONENUM 0.99 1.00 0.99 5039
|
55 |
+
USERNAME 0.98 0.98 0.98 7680
|
56 |
+
ZIPCODE 0.94 0.97 0.95 3191
|
57 |
+
|
58 |
+
It is a fine-tuned version of [microsoft/mdeberta-v3-base](https://huggingface.co/microsoft/mdeberta-v3-base).
|
59 |
+
It achieves the following results on a test set of ~73,000 sentences containing PII:
|
60 |
+
- Accuracy: 99.44%
|
61 |
- Loss: 0.0173
|
62 |
+
- Precision: 93.16%
|
63 |
+
- Recall: 93.08%
|
64 |
+
- F1: 93.12%
|
|
|
65 |
|
66 |
## Model description
|
67 |
|
|
|
140 |
- Transformers 4.44.2
|
141 |
- Pytorch 2.4.1+cu121
|
142 |
- Datasets 3.0.0
|
143 |
+
- Tokenizers 0.19.1
|