real-jiakai commited on
Commit
ab0d837
1 Parent(s): 9dfe76b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +91 -30
README.md CHANGED
@@ -1,65 +1,126 @@
1
  ---
2
- library_name: transformers
3
  license: apache-2.0
4
  base_model: bert-base-uncased
5
  tags:
6
  - generated_from_trainer
 
 
 
 
7
  metrics:
8
  - accuracy
9
  - f1
 
 
10
  model-index:
11
- - name: results
12
- results: []
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
  ---
14
 
15
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
16
- should probably proofread and complete it, then remove this comment. -->
17
 
18
- # results
19
-
20
- This model is a fine-tuned version of [bert-base-uncased](https://huggingface.co/bert-base-uncased) on an unknown dataset.
21
- It achieves the following results on the evaluation set:
22
- - Loss: 0.5471
23
- - Accuracy: 0.8652
24
- - F1: 0.9057
25
 
26
  ## Model description
27
 
28
- More information needed
 
 
 
 
 
 
29
 
30
  ## Intended uses & limitations
31
 
32
- More information needed
 
 
 
 
 
 
 
 
 
 
 
33
 
34
  ## Training and evaluation data
35
 
36
- More information needed
 
 
 
 
37
 
38
  ## Training procedure
39
 
40
  ### Training hyperparameters
41
-
42
  The following hyperparameters were used during training:
43
- - learning_rate: 3e-05
44
- - train_batch_size: 8
45
- - eval_batch_size: 8
46
- - seed: 42
47
- - optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
48
- - lr_scheduler_type: linear
49
- - num_epochs: 3
50
 
51
  ### Training results
52
-
53
  | Training Loss | Epoch | Step | Validation Loss | Accuracy | F1 |
54
  |:-------------:|:-----:|:----:|:---------------:|:--------:|:------:|
55
- | No log | 1.0 | 459 | 0.3905 | 0.8382 | 0.8878 |
56
- | 0.5385 | 2.0 | 918 | 0.4275 | 0.8505 | 0.8961 |
57
- | 0.3054 | 3.0 | 1377 | 0.5471 | 0.8652 | 0.9057 |
58
-
59
 
60
  ### Framework versions
61
-
62
  - Transformers 4.46.2
63
- - Pytorch 2.5.1+cu121
64
  - Datasets 3.1.0
65
  - Tokenizers 0.20.3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language: en
3
  license: apache-2.0
4
  base_model: bert-base-uncased
5
  tags:
6
  - generated_from_trainer
7
+ - paraphrase-identification
8
+ - bert
9
+ - glue
10
+ - mrpc
11
  metrics:
12
  - accuracy
13
  - f1
14
+ datasets:
15
+ - glue
16
  model-index:
17
+ - name: bert-base-uncased-finetuned-mrpc
18
+ results:
19
+ - task:
20
+ type: text-classification
21
+ name: Paraphrase Identification
22
+ dataset:
23
+ name: GLUE MRPC
24
+ type: glue
25
+ args: mrpc
26
+ metrics:
27
+ - name: Accuracy
28
+ type: accuracy
29
+ value: 0.8652
30
+ - name: F1
31
+ type: f1
32
+ value: 0.9057
33
  ---
34
 
35
+ # BERT Fine-tuned on MRPC
 
36
 
37
+ This model is a fine-tuned version of [bert-base-uncased](https://huggingface.co/bert-base-uncased) on the MRPC (Microsoft Research Paraphrase Corpus) dataset from the GLUE benchmark. It is designed to determine whether two given sentences are semantically equivalent.
 
 
 
 
 
 
38
 
39
  ## Model description
40
 
41
+ The model uses the BERT base architecture (12 layers, 768 hidden dimensions, 12 attention heads) and has been fine-tuned specifically for the paraphrase identification task. The output layer predicts whether the input sentence pair expresses the same meaning.
42
+
43
+ Key specifications:
44
+ - Base model: bert-base-uncased
45
+ - Task type: Binary classification (paraphrase/not paraphrase)
46
+ - Training method: Fine-tuning all layers
47
+ - Language: English
48
 
49
  ## Intended uses & limitations
50
 
51
+ ### Intended uses
52
+ - Paraphrase detection
53
+ - Semantic similarity assessment
54
+ - Question duplicate detection
55
+ - Content matching
56
+ - Automated text comparison
57
+
58
+ ### Limitations
59
+ - Only works with English text
60
+ - Performance may degrade on out-of-domain text
61
+ - May struggle with complex or nuanced semantic relationships
62
+ - Limited to comparing pairs of sentences (not longer texts)
63
 
64
  ## Training and evaluation data
65
 
66
+ The model was trained on the Microsoft Research Paraphrase Corpus (MRPC) from the GLUE benchmark:
67
+ - Training set: 3,667 sentence pairs
68
+ - Validation set: 408 sentence pairs
69
+ - Each pair is labeled as either paraphrase (1) or non-paraphrase (0)
70
+ - Class distribution: approximately 67.4% positive (paraphrase) and 32.6% negative (non-paraphrase)
71
 
72
  ## Training procedure
73
 
74
  ### Training hyperparameters
 
75
  The following hyperparameters were used during training:
76
+ - Learning rate: 3e-05
77
+ - Batch size: 8 (train and eval)
78
+ - Optimizer: AdamW (betas=(0.9,0.999), epsilon=1e-08)
79
+ - LR scheduler: Linear decay
80
+ - Number of epochs: 3
81
+ - Max sequence length: 512
82
+ - Weight decay: 0.01
83
 
84
  ### Training results
 
85
  | Training Loss | Epoch | Step | Validation Loss | Accuracy | F1 |
86
  |:-------------:|:-----:|:----:|:---------------:|:--------:|:------:|
87
+ | No log | 1.0 | 459 | 0.3905 | 0.8382 | 0.8878 |
88
+ | 0.5385 | 2.0 | 918 | 0.4275 | 0.8505 | 0.8961 |
89
+ | 0.3054 | 3.0 | 1377 | 0.5471 | 0.8652 | 0.9057 |
 
90
 
91
  ### Framework versions
 
92
  - Transformers 4.46.2
93
+ - PyTorch 2.5.1+cu121
94
  - Datasets 3.1.0
95
  - Tokenizers 0.20.3
96
+
97
+ ## Performance analysis
98
+
99
+ The model achieves strong performance on the MRPC validation set:
100
+ - Accuracy: 86.52%
101
+ - F1 Score: 90.57%
102
+
103
+ These metrics indicate that the model is effective at identifying paraphrases while maintaining a good balance between precision and recall.
104
+
105
+ ## Example usage
106
+
107
+ ```python
108
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
109
+
110
+ # Load model and tokenizer
111
+ tokenizer = AutoTokenizer.from_pretrained("real-jiakai/bert-base-uncased-finetuned-mrpc")
112
+ model = AutoModelForSequenceClassification.from_pretrained("real-jiakai/bert-base-uncased-finetuned-mrpc")
113
+
114
+ # Example function
115
+ def check_paraphrase(sentence1, sentence2):
116
+ inputs = tokenizer(sentence1, sentence2, return_tensors="pt", padding=True, truncation=True)
117
+ outputs = model(**inputs)
118
+ prediction = outputs.logits.argmax().item()
119
+ return "Paraphrase" if prediction == 1 else "Not paraphrase"
120
+
121
+ # Example usage
122
+ sentence1 = "The cat sat on the mat."
123
+ sentence2 = "A cat was sitting on the mat."
124
+ result = check_paraphrase(sentence1, sentence2)
125
+ print(f"Result: {result}")
126
+ ```