real-jiakai
commited on
Commit
•
ab0d837
1
Parent(s):
9dfe76b
Update README.md
Browse files
README.md
CHANGED
@@ -1,65 +1,126 @@
|
|
1 |
---
|
2 |
-
|
3 |
license: apache-2.0
|
4 |
base_model: bert-base-uncased
|
5 |
tags:
|
6 |
- generated_from_trainer
|
|
|
|
|
|
|
|
|
7 |
metrics:
|
8 |
- accuracy
|
9 |
- f1
|
|
|
|
|
10 |
model-index:
|
11 |
-
- name:
|
12 |
-
results:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
13 |
---
|
14 |
|
15 |
-
|
16 |
-
should probably proofread and complete it, then remove this comment. -->
|
17 |
|
18 |
-
|
19 |
-
|
20 |
-
This model is a fine-tuned version of [bert-base-uncased](https://huggingface.co/bert-base-uncased) on an unknown dataset.
|
21 |
-
It achieves the following results on the evaluation set:
|
22 |
-
- Loss: 0.5471
|
23 |
-
- Accuracy: 0.8652
|
24 |
-
- F1: 0.9057
|
25 |
|
26 |
## Model description
|
27 |
|
28 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
29 |
|
30 |
## Intended uses & limitations
|
31 |
|
32 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
33 |
|
34 |
## Training and evaluation data
|
35 |
|
36 |
-
|
|
|
|
|
|
|
|
|
37 |
|
38 |
## Training procedure
|
39 |
|
40 |
### Training hyperparameters
|
41 |
-
|
42 |
The following hyperparameters were used during training:
|
43 |
-
-
|
44 |
-
-
|
45 |
-
-
|
46 |
-
-
|
47 |
-
-
|
48 |
-
-
|
49 |
-
-
|
50 |
|
51 |
### Training results
|
52 |
-
|
53 |
| Training Loss | Epoch | Step | Validation Loss | Accuracy | F1 |
|
54 |
|:-------------:|:-----:|:----:|:---------------:|:--------:|:------:|
|
55 |
-
| No log | 1.0 | 459 | 0.3905
|
56 |
-
| 0.5385 | 2.0 | 918 | 0.4275
|
57 |
-
| 0.3054 | 3.0 | 1377 | 0.5471
|
58 |
-
|
59 |
|
60 |
### Framework versions
|
61 |
-
|
62 |
- Transformers 4.46.2
|
63 |
-
-
|
64 |
- Datasets 3.1.0
|
65 |
- Tokenizers 0.20.3
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
language: en
|
3 |
license: apache-2.0
|
4 |
base_model: bert-base-uncased
|
5 |
tags:
|
6 |
- generated_from_trainer
|
7 |
+
- paraphrase-identification
|
8 |
+
- bert
|
9 |
+
- glue
|
10 |
+
- mrpc
|
11 |
metrics:
|
12 |
- accuracy
|
13 |
- f1
|
14 |
+
datasets:
|
15 |
+
- glue
|
16 |
model-index:
|
17 |
+
- name: bert-base-uncased-finetuned-mrpc
|
18 |
+
results:
|
19 |
+
- task:
|
20 |
+
type: text-classification
|
21 |
+
name: Paraphrase Identification
|
22 |
+
dataset:
|
23 |
+
name: GLUE MRPC
|
24 |
+
type: glue
|
25 |
+
args: mrpc
|
26 |
+
metrics:
|
27 |
+
- name: Accuracy
|
28 |
+
type: accuracy
|
29 |
+
value: 0.8652
|
30 |
+
- name: F1
|
31 |
+
type: f1
|
32 |
+
value: 0.9057
|
33 |
---
|
34 |
|
35 |
+
# BERT Fine-tuned on MRPC
|
|
|
36 |
|
37 |
+
This model is a fine-tuned version of [bert-base-uncased](https://huggingface.co/bert-base-uncased) on the MRPC (Microsoft Research Paraphrase Corpus) dataset from the GLUE benchmark. It is designed to determine whether two given sentences are semantically equivalent.
|
|
|
|
|
|
|
|
|
|
|
|
|
38 |
|
39 |
## Model description
|
40 |
|
41 |
+
The model uses the BERT base architecture (12 layers, 768 hidden dimensions, 12 attention heads) and has been fine-tuned specifically for the paraphrase identification task. The output layer predicts whether the input sentence pair expresses the same meaning.
|
42 |
+
|
43 |
+
Key specifications:
|
44 |
+
- Base model: bert-base-uncased
|
45 |
+
- Task type: Binary classification (paraphrase/not paraphrase)
|
46 |
+
- Training method: Fine-tuning all layers
|
47 |
+
- Language: English
|
48 |
|
49 |
## Intended uses & limitations
|
50 |
|
51 |
+
### Intended uses
|
52 |
+
- Paraphrase detection
|
53 |
+
- Semantic similarity assessment
|
54 |
+
- Question duplicate detection
|
55 |
+
- Content matching
|
56 |
+
- Automated text comparison
|
57 |
+
|
58 |
+
### Limitations
|
59 |
+
- Only works with English text
|
60 |
+
- Performance may degrade on out-of-domain text
|
61 |
+
- May struggle with complex or nuanced semantic relationships
|
62 |
+
- Limited to comparing pairs of sentences (not longer texts)
|
63 |
|
64 |
## Training and evaluation data
|
65 |
|
66 |
+
The model was trained on the Microsoft Research Paraphrase Corpus (MRPC) from the GLUE benchmark:
|
67 |
+
- Training set: 3,667 sentence pairs
|
68 |
+
- Validation set: 408 sentence pairs
|
69 |
+
- Each pair is labeled as either paraphrase (1) or non-paraphrase (0)
|
70 |
+
- Class distribution: approximately 67.4% positive (paraphrase) and 32.6% negative (non-paraphrase)
|
71 |
|
72 |
## Training procedure
|
73 |
|
74 |
### Training hyperparameters
|
|
|
75 |
The following hyperparameters were used during training:
|
76 |
+
- Learning rate: 3e-05
|
77 |
+
- Batch size: 8 (train and eval)
|
78 |
+
- Optimizer: AdamW (betas=(0.9,0.999), epsilon=1e-08)
|
79 |
+
- LR scheduler: Linear decay
|
80 |
+
- Number of epochs: 3
|
81 |
+
- Max sequence length: 512
|
82 |
+
- Weight decay: 0.01
|
83 |
|
84 |
### Training results
|
|
|
85 |
| Training Loss | Epoch | Step | Validation Loss | Accuracy | F1 |
|
86 |
|:-------------:|:-----:|:----:|:---------------:|:--------:|:------:|
|
87 |
+
| No log | 1.0 | 459 | 0.3905 | 0.8382 | 0.8878 |
|
88 |
+
| 0.5385 | 2.0 | 918 | 0.4275 | 0.8505 | 0.8961 |
|
89 |
+
| 0.3054 | 3.0 | 1377 | 0.5471 | 0.8652 | 0.9057 |
|
|
|
90 |
|
91 |
### Framework versions
|
|
|
92 |
- Transformers 4.46.2
|
93 |
+
- PyTorch 2.5.1+cu121
|
94 |
- Datasets 3.1.0
|
95 |
- Tokenizers 0.20.3
|
96 |
+
|
97 |
+
## Performance analysis
|
98 |
+
|
99 |
+
The model achieves strong performance on the MRPC validation set:
|
100 |
+
- Accuracy: 86.52%
|
101 |
+
- F1 Score: 90.57%
|
102 |
+
|
103 |
+
These metrics indicate that the model is effective at identifying paraphrases while maintaining a good balance between precision and recall.
|
104 |
+
|
105 |
+
## Example usage
|
106 |
+
|
107 |
+
```python
|
108 |
+
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
109 |
+
|
110 |
+
# Load model and tokenizer
|
111 |
+
tokenizer = AutoTokenizer.from_pretrained("real-jiakai/bert-base-uncased-finetuned-mrpc")
|
112 |
+
model = AutoModelForSequenceClassification.from_pretrained("real-jiakai/bert-base-uncased-finetuned-mrpc")
|
113 |
+
|
114 |
+
# Example function
|
115 |
+
def check_paraphrase(sentence1, sentence2):
|
116 |
+
inputs = tokenizer(sentence1, sentence2, return_tensors="pt", padding=True, truncation=True)
|
117 |
+
outputs = model(**inputs)
|
118 |
+
prediction = outputs.logits.argmax().item()
|
119 |
+
return "Paraphrase" if prediction == 1 else "Not paraphrase"
|
120 |
+
|
121 |
+
# Example usage
|
122 |
+
sentence1 = "The cat sat on the mat."
|
123 |
+
sentence2 = "A cat was sitting on the mat."
|
124 |
+
result = check_paraphrase(sentence1, sentence2)
|
125 |
+
print(f"Result: {result}")
|
126 |
+
```
|