File size: 8,959 Bytes
9894fa5
 
f6f0fdb
9894fa5
 
 
bb9913f
 
 
 
 
9894fa5
 
 
 
 
 
 
 
bb9913f
 
 
 
 
 
 
 
 
 
9894fa5
 
07435a5
175657e
87e2e03
 
506726d
f6f0fdb
3b18682
bb9913f
3b18682
926d0da
 
 
3b18682
6afda1a
506726d
a825724
506726d
3b18682
 
87e2e03
28ce2b4
6afda1a
 
 
 
bb9913f
 
9894fa5
bb9913f
 
 
9894fa5
3b18682
9894fa5
3b18682
 
 
 
9894fa5
3b18682
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9894fa5
3b18682
9894fa5
3b18682
9894fa5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5c5deb1
 
 
9894fa5
 
 
 
 
bb9913f
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
---
library_name: transformers
license: cc-by-nc-nd-4.0
base_model: microsoft/mdeberta-v3-base
tags:
- generated_from_trainer
- pii
- privacy
- personaldata
- redaction
- piidetection
metrics:
- precision
- recall
- f1
- accuracy
model-index:
- name: piiranha-1
  results: []
datasets:
- ai4privacy/pii-masking-400k
language:
- en
- it
- fr
- de
- nl
- es
pipeline_tag: token-classification
---

# Piiranha-v1: Protect your personal information!
<a target="_blank" href="https://colab.research.google.com/github/williamgao1729/piiranha-quickstart/blob/main/piiranha_quickstart%20(1).ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

Piiranha (cc-by-nc-nd-4.0 license) is trained to **detect 17 types** of Personally Identifiable Information (PII) across six languages. It successfully **catches 98.27% of PII** tokens, with an overall classification **accuracy of 99.44%**.
Piiranha is especially accurate at detecting passwords, emails (100%), phone numbers, and usernames.

Performance on PII vs. Non PII classification task:
- **Precision: 98.48%** (98.48% of tokens classified as PII are actually PII)
- **Recall: 98.27%** (correctly identifies 98.27% of PII tokens)
- **Specificity: 99.84%** (correctly identifies 99.84% of Non PII tokens)

<img src="https://cloud-3i4ld6u5y-hack-club-bot.vercel.app/0home.png" alt="Akash Network logo" width="250"/>

Piiranha was trained on H100 GPUs generously sponsored by the [Akash Network](https://akash.network)

## Model Description
Piiranha is a fine-tuned version of [microsoft/mdeberta-v3-base](https://huggingface.co/microsoft/mdeberta-v3-base).
The context length is 256 Deberta tokens. If your text is longer than that, just split it up.

Supported languages: English, Spanish, French, German, Italian, Dutch

Supported PII types: Account Number, Building Number, City, Credit Card Number, Date of Birth, Driver's License, Email, First Name, Last Name, ID Card, Password, Social Security Number, Street Address, Tax Number, Phone Number, Username, Zipcode. 

It achieves the following results on a test set of ~73,000 sentences containing PII:
- Accuracy: 99.44%
- Loss: 0.0173
- Precision: 93.16%
- Recall: 93.08%
- F1: 93.12%

Note that the above metrics factor in the eighteen possible categories (17 PII and 1 Non PII), so the metrics are lower than the metrics for just PII vs. Non PII (binary classification).

## Performance by PII type
Reported performance metrics are lower than the overall accuracy of 99.44% due to class imbalance (most tokens are not PII).
However, the model is more useful than the below results suggest, due to the intent behind PII detection. The model sometimes misclassifies one PII type for another, but at the end of the day, it still recognizes the token as PII.
For instance, the model often confuses first names for last names, but that's fine because it still flags the name as PII.

| Entity              | Precision | Recall | F1-Score | Support |
|---------------------|-----------|--------|----------|---------|
| ACCOUNTNUM          | 0.84      | 0.87   | 0.85     | 3575    |
| BUILDINGNUM         | 0.92      | 0.90   | 0.91     | 3252    |
| CITY                | 0.95      | 0.97   | 0.96     | 7270    |
| CREDITCARDNUMBER    | 0.94      | 0.96   | 0.95     | 2308    |
| DATEOFBIRTH         | 0.93      | 0.85   | 0.89     | 3389    |
| DRIVERLICENSENUM    | 0.96      | 0.96   | 0.96     | 2244    |
| EMAIL               | 1.00      | 1.00   | 1.00     | 6892    |
| GIVENNAME           | 0.87      | 0.93   | 0.90     | 12150   |
| IDCARDNUM           | 0.89      | 0.94   | 0.91     | 3700    |
| PASSWORD            | 0.98      | 0.98   | 0.98     | 2387    |
| SOCIALNUM           | 0.93      | 0.94   | 0.93     | 2709    |
| STREET              | 0.97      | 0.95   | 0.96     | 3331    |
| SURNAME             | 0.89      | 0.78   | 0.83     | 8267    |
| TAXNUM              | 0.97      | 0.89   | 0.93     | 2322    |
| TELEPHONENUM        | 0.99      | 1.00   | 0.99     | 5039    |
| USERNAME            | 0.98      | 0.98   | 0.98     | 7680    |
| ZIPCODE             | 0.94      | 0.97   | 0.95     | 3191    |
| **micro avg**       | 0.93      | 0.93   | 0.93     | 79706   |
| **macro avg**       | 0.94      | 0.93   | 0.93     | 79706   |
| **weighted avg**    | 0.93      | 0.93   | 0.93     | 79706   |

## Intended uses & limitations

Piiranha can be used to assist with redacting PII from texts. Use at your own risk. We do not accept responsibility for any incorrect model predictions.
## Training and evaluation data

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 128
- eval_batch_size: 128
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_ratio: 0.05
- num_epochs: 5
- mixed_precision_training: Native AMP

### Training results

| Training Loss | Epoch  | Step | Validation Loss | Precision | Recall | F1     | Accuracy |
|:-------------:|:------:|:----:|:---------------:|:---------:|:------:|:------:|:--------:|
| 0.2984        | 0.0983 | 250  | 0.1005          | 0.5446    | 0.6111 | 0.5759 | 0.9702   |
| 0.0568        | 0.1965 | 500  | 0.0464          | 0.7895    | 0.8459 | 0.8167 | 0.9849   |
| 0.0441        | 0.2948 | 750  | 0.0400          | 0.8346    | 0.8669 | 0.8504 | 0.9869   |
| 0.0368        | 0.3931 | 1000 | 0.0320          | 0.8531    | 0.8784 | 0.8656 | 0.9891   |
| 0.0323        | 0.4914 | 1250 | 0.0293          | 0.8779    | 0.8889 | 0.8834 | 0.9903   |
| 0.0287        | 0.5896 | 1500 | 0.0269          | 0.8919    | 0.8836 | 0.8877 | 0.9907   |
| 0.0282        | 0.6879 | 1750 | 0.0276          | 0.8724    | 0.9012 | 0.8866 | 0.9903   |
| 0.0268        | 0.7862 | 2000 | 0.0254          | 0.8890    | 0.9041 | 0.8965 | 0.9914   |
| 0.0264        | 0.8844 | 2250 | 0.0236          | 0.8886    | 0.9040 | 0.8962 | 0.9915   |
| 0.0243        | 0.9827 | 2500 | 0.0232          | 0.8998    | 0.9033 | 0.9015 | 0.9917   |
| 0.0213        | 1.0810 | 2750 | 0.0237          | 0.9115    | 0.9040 | 0.9077 | 0.9923   |
| 0.0213        | 1.1792 | 3000 | 0.0222          | 0.9123    | 0.9143 | 0.9133 | 0.9925   |
| 0.0217        | 1.2775 | 3250 | 0.0222          | 0.8999    | 0.9169 | 0.9083 | 0.9924   |
| 0.0209        | 1.3758 | 3500 | 0.0212          | 0.9111    | 0.9133 | 0.9122 | 0.9928   |
| 0.0204        | 1.4741 | 3750 | 0.0206          | 0.9054    | 0.9203 | 0.9128 | 0.9926   |
| 0.0183        | 1.5723 | 4000 | 0.0212          | 0.9126    | 0.9160 | 0.9143 | 0.9927   |
| 0.0191        | 1.6706 | 4250 | 0.0192          | 0.9122    | 0.9192 | 0.9157 | 0.9929   |
| 0.0185        | 1.7689 | 4500 | 0.0195          | 0.9200    | 0.9191 | 0.9196 | 0.9932   |
| 0.018         | 1.8671 | 4750 | 0.0188          | 0.9136    | 0.9215 | 0.9176 | 0.9933   |
| 0.0183        | 1.9654 | 5000 | 0.0191          | 0.9179    | 0.9212 | 0.9196 | 0.9934   |
| 0.0147        | 2.0637 | 5250 | 0.0188          | 0.9246    | 0.9242 | 0.9244 | 0.9937   |
| 0.0149        | 2.1619 | 5500 | 0.0184          | 0.9188    | 0.9254 | 0.9221 | 0.9937   |
| 0.0143        | 2.2602 | 5750 | 0.0193          | 0.9187    | 0.9224 | 0.9205 | 0.9932   |
| 0.014         | 2.3585 | 6000 | 0.0190          | 0.9246    | 0.9280 | 0.9263 | 0.9936   |
| 0.0146        | 2.4568 | 6250 | 0.0190          | 0.9225    | 0.9277 | 0.9251 | 0.9936   |
| 0.0148        | 2.5550 | 6500 | 0.0175          | 0.9297    | 0.9306 | 0.9301 | 0.9942   |
| 0.0136        | 2.6533 | 6750 | 0.0172          | 0.9191    | 0.9329 | 0.9259 | 0.9938   |
| 0.0137        | 2.7516 | 7000 | 0.0166          | 0.9299    | 0.9312 | 0.9306 | 0.9942   |
| 0.014         | 2.8498 | 7250 | 0.0167          | 0.9285    | 0.9313 | 0.9299 | 0.9942   |
| 0.0128        | 2.9481 | 7500 | 0.0166          | 0.9271    | 0.9326 | 0.9298 | 0.9943   |
| 0.0113        | 3.0464 | 7750 | 0.0171          | 0.9286    | 0.9347 | 0.9316 | 0.9946   |
| 0.0103        | 3.1447 | 8000 | 0.0172          | 0.9284    | 0.9383 | 0.9334 | 0.9945   |
| 0.0104        | 3.2429 | 8250 | 0.0169          | 0.9312    | 0.9406 | 0.9359 | 0.9947   |
| 0.0094        | 3.3412 | 8500 | 0.0166          | 0.9368    | 0.9359 | 0.9364 | 0.9948   |
| 0.01          | 3.4395 | 8750 | 0.0166          | 0.9289    | 0.9387 | 0.9337 | 0.9944   |
| 0.0099        | 3.5377 | 9000 | 0.0162          | 0.9335    | 0.9332 | 0.9334 | 0.9947   |
| 0.0099        | 3.6360 | 9250 | 0.0160          | 0.9321    | 0.9380 | 0.9350 | 0.9947   |
| 0.01          | 3.7343 | 9500 | 0.0168          | 0.9306    | 0.9389 | 0.9347 | 0.9947   |
| 0.0101        | 3.8325 | 9750 | 0.0159          | 0.9339    | 0.9350 | 0.9344 | 0.9947   |

### Contact
william (at) integrinet [dot] org
 
### Framework versions

- Transformers 4.44.2
- Pytorch 2.4.1+cu121
- Datasets 3.0.0
- Tokenizers 0.19.1