gaodrew commited on
Commit
bb9913f
·
verified ·
1 Parent(s): 9894fa5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +46 -10
README.md CHANGED
@@ -4,6 +4,11 @@ license: mit
4
  base_model: microsoft/mdeberta-v3-base
5
  tags:
6
  - generated_from_trainer
 
 
 
 
 
7
  metrics:
8
  - precision
9
  - recall
@@ -12,20 +17,51 @@ metrics:
12
  model-index:
13
  - name: piiranha-1
14
  results: []
 
 
 
 
 
 
 
 
 
 
15
  ---
16
 
17
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
18
- should probably proofread and complete it, then remove this comment. -->
19
 
20
- # piiranha-1
21
 
22
- This model is a fine-tuned version of [microsoft/mdeberta-v3-base](https://huggingface.co/microsoft/mdeberta-v3-base) on an unknown dataset.
23
- It achieves the following results on the evaluation set:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24
  - Loss: 0.0173
25
- - Precision: 0.9316
26
- - Recall: 0.9308
27
- - F1: 0.9312
28
- - Accuracy: 0.9944
29
 
30
  ## Model description
31
 
@@ -104,4 +140,4 @@ The following hyperparameters were used during training:
104
  - Transformers 4.44.2
105
  - Pytorch 2.4.1+cu121
106
  - Datasets 3.0.0
107
- - Tokenizers 0.19.1
 
4
  base_model: microsoft/mdeberta-v3-base
5
  tags:
6
  - generated_from_trainer
7
+ - pii
8
+ - privacy
9
+ - personaldata
10
+ - redaction
11
+ - piidetection
12
  metrics:
13
  - precision
14
  - recall
 
17
  model-index:
18
  - name: piiranha-1
19
  results: []
20
+ datasets:
21
+ - ai4privacy/pii-masking-400k
22
+ language:
23
+ - en
24
+ - it
25
+ - fr
26
+ - de
27
+ - nl
28
+ - es
29
+ pipeline_tag: token-classification
30
  ---
31
 
 
 
32
 
 
33
 
34
+ # piiranha-v1
35
+ Piiranha is trained to detect 17 types of Personally Identifiable Information (PII) across six languages. It successfully catches 98.27% of PII tokens, with an overall classification accuracy of 99.44%.
36
+
37
+ Supported languages: English, Spanish, French, German, Italian, Dutch
38
+ Supported PII types: Account Number, Building Number, City, Credit Card Number, Date of Birth, Driver's License, Email, First Name, Last Name, ID Card, Password, Social Security Number, Street Address, Tax Number, Phone Number, Username, Zipcode.
39
+
40
+ ACCOUNTNUM 0.84 0.87 0.85 3575
41
+ BUILDINGNUM 0.92 0.90 0.91 3252
42
+ CITY 0.95 0.97 0.96 7270
43
+ CREDITCARDNUMBER 0.94 0.96 0.95 2308
44
+ DATEOFBIRTH 0.93 0.85 0.89 3389
45
+ DRIVERLICENSENUM 0.96 0.96 0.96 2244
46
+ EMAIL 1.00 1.00 1.00 6892
47
+ GIVENNAME 0.87 0.93 0.90 12150
48
+ IDCARDNUM 0.89 0.94 0.91 3700
49
+ PASSWORD 0.98 0.98 0.98 2387
50
+ SOCIALNUM 0.93 0.94 0.93 2709
51
+ STREET 0.97 0.95 0.96 3331
52
+ SURNAME 0.89 0.78 0.83 8267
53
+ TAXNUM 0.97 0.89 0.93 2322
54
+ TELEPHONENUM 0.99 1.00 0.99 5039
55
+ USERNAME 0.98 0.98 0.98 7680
56
+ ZIPCODE 0.94 0.97 0.95 3191
57
+
58
+ It is a fine-tuned version of [microsoft/mdeberta-v3-base](https://huggingface.co/microsoft/mdeberta-v3-base).
59
+ It achieves the following results on a test set of ~73,000 sentences containing PII:
60
+ - Accuracy: 99.44%
61
  - Loss: 0.0173
62
+ - Precision: 93.16%
63
+ - Recall: 93.08%
64
+ - F1: 93.12%
 
65
 
66
  ## Model description
67
 
 
140
  - Transformers 4.44.2
141
  - Pytorch 2.4.1+cu121
142
  - Datasets 3.0.0
143
+ - Tokenizers 0.19.1