xiaowenbin
commited on
Commit
·
128ae92
1
Parent(s):
d4009b0
init commit
Browse files- .gitattributes +1 -0
- 1_Pooling/config.json +7 -0
- README.md +106 -0
- config.json +31 -0
- config_sentence_transformers.json +7 -0
- modules.json +14 -0
- pytorch_model.bin +3 -0
- sentence_bert_config.json +4 -0
- special_tokens_map.json +1 -0
- tokenizer.json +0 -0
- tokenizer_config.json +1 -0
- vocab.txt +0 -0
.gitattributes
CHANGED
@@ -26,3 +26,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
26 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
27 |
*.zstandard filter=lfs diff=lfs merge=lfs -text
|
28 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
26 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
27 |
*.zstandard filter=lfs diff=lfs merge=lfs -text
|
28 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
29 |
+
pytorch_model.bin filter=lfs diff=lfs merge=lfs -text
|
1_Pooling/config.json
ADDED
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"word_embedding_dimension": 768,
|
3 |
+
"pooling_mode_cls_token": false,
|
4 |
+
"pooling_mode_mean_tokens": true,
|
5 |
+
"pooling_mode_max_tokens": false,
|
6 |
+
"pooling_mode_mean_sqrt_len_tokens": false
|
7 |
+
}
|
README.md
ADDED
@@ -0,0 +1,106 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
pipeline_tag: sentence-similarity
|
3 |
+
tags:
|
4 |
+
- sentence-transformers
|
5 |
+
- feature-extraction
|
6 |
+
- sentence-similarity
|
7 |
+
- transformers
|
8 |
+
- semantic-search
|
9 |
+
- chinese
|
10 |
+
---
|
11 |
+
|
12 |
+
# DMetaSoul/sbert-chinese-qmc-domain-v1
|
13 |
+
|
14 |
+
此模型是基于之前开源[问题匹配模型](https://huggingface.co/DMetaSoul/sbert-chinese-qmc-domain-v1)的蒸馏轻量化版本(仅含4层 BERT),适用于**开放领域的问题匹配**场景,比如:
|
15 |
+
|
16 |
+
|
17 |
+
- 洗澡用什么香皂好?vs. 洗澡用什么香皂好
|
18 |
+
- 大连哪里拍婚纱照好点? vs. 大连哪里拍婚纱照比较好
|
19 |
+
- 银行卡怎样挂失?vs. 银行卡丢了怎么挂失啊?
|
20 |
+
|
21 |
+
离线训练好的大模型如果直接用于线上推理,对计算资源有苛刻的需求,而且难以满足业务环境对延迟、吞吐量等性能指标的要求,这里我们使用蒸馏手段来把大模型轻量化。从 12 层 BERT 蒸馏为 4 层后,模型参数量缩小到 44%,大概 latency 减半、throughput 翻倍、精度下降 4% 左右(具体结果详见下文评估小节)。
|
22 |
+
|
23 |
+
# Usage
|
24 |
+
|
25 |
+
## 1. Sentence-Transformers
|
26 |
+
|
27 |
+
通过 [sentence-transformers](https://www.SBERT.net) 框架来使用该模型,首先进行安装:
|
28 |
+
|
29 |
+
```
|
30 |
+
pip install -U sentence-transformers
|
31 |
+
```
|
32 |
+
|
33 |
+
然后使用下面的代码来载入该模型并进行文本表征向量的提取:
|
34 |
+
|
35 |
+
```python
|
36 |
+
from sentence_transformers import SentenceTransformer
|
37 |
+
sentences = ["我的儿子!他猛然间喊道,我的儿子在哪儿?", "我的儿子呢!他突然喊道,我的儿子在哪里?"]
|
38 |
+
|
39 |
+
model = SentenceTransformer('DMetaSoul/sbert-chinese-qmc-domain-v1')
|
40 |
+
embeddings = model.encode(sentences)
|
41 |
+
print(embeddings)
|
42 |
+
```
|
43 |
+
|
44 |
+
## 2. HuggingFace Transformers
|
45 |
+
|
46 |
+
如果不想使用 [sentence-transformers](https://www.SBERT.net) 的话,也可以通过 HuggingFace Transformers 来载入该模型并进行文本向量抽取:
|
47 |
+
|
48 |
+
```python
|
49 |
+
from transformers import AutoTokenizer, AutoModel
|
50 |
+
import torch
|
51 |
+
|
52 |
+
|
53 |
+
#Mean Pooling - Take attention mask into account for correct averaging
|
54 |
+
def mean_pooling(model_output, attention_mask):
|
55 |
+
token_embeddings = model_output[0] #First element of model_output contains all token embeddings
|
56 |
+
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
|
57 |
+
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
|
58 |
+
|
59 |
+
|
60 |
+
# Sentences we want sentence embeddings for
|
61 |
+
sentences = ["我的儿子!他猛然间喊道,我的儿子在哪儿?", "我的儿子呢!他突然喊道,我的儿子在哪里?"]
|
62 |
+
|
63 |
+
# Load model from HuggingFace Hub
|
64 |
+
tokenizer = AutoTokenizer.from_pretrained('DMetaSoul/sbert-chinese-qmc-domain-v1')
|
65 |
+
model = AutoModel.from_pretrained('DMetaSoul/sbert-chinese-qmc-domain-v1')
|
66 |
+
|
67 |
+
# Tokenize sentences
|
68 |
+
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
|
69 |
+
|
70 |
+
# Compute token embeddings
|
71 |
+
with torch.no_grad():
|
72 |
+
model_output = model(**encoded_input)
|
73 |
+
|
74 |
+
# Perform pooling. In this case, mean pooling.
|
75 |
+
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
|
76 |
+
|
77 |
+
print("Sentence embeddings:")
|
78 |
+
print(sentence_embeddings)
|
79 |
+
```
|
80 |
+
|
81 |
+
## Evaluation
|
82 |
+
|
83 |
+
这里主要跟蒸馏前对应的 teacher 模型作了对比
|
84 |
+
|
85 |
+
*性能:*
|
86 |
+
|
87 |
+
| | Teacher | Student | Gap |
|
88 |
+
| ---------- | --------------------- | ------------------- | ----- |
|
89 |
+
| Model | BERT-12-layers (102M) | BERT-4-layers (45M) | 0.44x |
|
90 |
+
| Cost | 23s | 12s | -47% |
|
91 |
+
| Latency | 38ms | 20ms | -47% |
|
92 |
+
| Throughput | 421 sentence/s | 791 sentence/s | 1.9x |
|
93 |
+
|
94 |
+
*精度:*
|
95 |
+
|
96 |
+
| | **csts_dev** | **csts_test** | **afqmc** | **lcqmc** | **bqcorpus** | **pawsx** | **xiaobu** | **Avg** |
|
97 |
+
| -------------- | ------------ | ------------- | --------- | --------- | ------------ | --------- | ---------- | ------- |
|
98 |
+
| **Teacher** | 80.90% | 76.62% | 34.51% | 77.05% | 52.95% | 12.97% | 59.47% | 56.35% |
|
99 |
+
| **Student** | 79.89% | 76.34% | 27.59% | 69.26% | 49.40% | 9.06% | 53.52% | 52.15% |
|
100 |
+
| **Gap** (abs.) | - | - | - | - | - | - | - | -4.2% |
|
101 |
+
|
102 |
+
*基于1万条数据测试,GPU设备是V100,batch_size=16,max_seq_len=256*
|
103 |
+
|
104 |
+
## Citing & Authors
|
105 |
+
|
106 |
+
E-mail: [email protected]
|
config.json
ADDED
@@ -0,0 +1,31 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"_name_or_path": "releases/sbert-chinese-qmc-domain-v1-distill/",
|
3 |
+
"architectures": [
|
4 |
+
"BertModel"
|
5 |
+
],
|
6 |
+
"attention_probs_dropout_prob": 0.1,
|
7 |
+
"classifier_dropout": null,
|
8 |
+
"directionality": "bidi",
|
9 |
+
"hidden_act": "gelu",
|
10 |
+
"hidden_dropout_prob": 0.1,
|
11 |
+
"hidden_size": 768,
|
12 |
+
"initializer_range": 0.02,
|
13 |
+
"intermediate_size": 3072,
|
14 |
+
"layer_norm_eps": 1e-12,
|
15 |
+
"max_position_embeddings": 512,
|
16 |
+
"model_type": "bert",
|
17 |
+
"num_attention_heads": 12,
|
18 |
+
"num_hidden_layers": 4,
|
19 |
+
"pad_token_id": 0,
|
20 |
+
"pooler_fc_size": 768,
|
21 |
+
"pooler_num_attention_heads": 12,
|
22 |
+
"pooler_num_fc_layers": 3,
|
23 |
+
"pooler_size_per_head": 128,
|
24 |
+
"pooler_type": "first_token_transform",
|
25 |
+
"position_embedding_type": "absolute",
|
26 |
+
"torch_dtype": "float32",
|
27 |
+
"transformers_version": "4.16.0",
|
28 |
+
"type_vocab_size": 2,
|
29 |
+
"use_cache": true,
|
30 |
+
"vocab_size": 21128
|
31 |
+
}
|
config_sentence_transformers.json
ADDED
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"__version__": {
|
3 |
+
"sentence_transformers": "2.1.0",
|
4 |
+
"transformers": "4.16.0",
|
5 |
+
"pytorch": "1.10.2"
|
6 |
+
}
|
7 |
+
}
|
modules.json
ADDED
@@ -0,0 +1,14 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
[
|
2 |
+
{
|
3 |
+
"idx": 0,
|
4 |
+
"name": "0",
|
5 |
+
"path": "",
|
6 |
+
"type": "sentence_transformers.models.Transformer"
|
7 |
+
},
|
8 |
+
{
|
9 |
+
"idx": 1,
|
10 |
+
"name": "1",
|
11 |
+
"path": "1_Pooling",
|
12 |
+
"type": "sentence_transformers.models.Pooling"
|
13 |
+
}
|
14 |
+
]
|
pytorch_model.bin
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:ad505e52d15c2e6c396da1e3ff39c4707368d1c9e91a5a8ed18c18a46e590b24
|
3 |
+
size 182288973
|
sentence_bert_config.json
ADDED
@@ -0,0 +1,4 @@
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"max_seq_length": 256,
|
3 |
+
"do_lower_case": false
|
4 |
+
}
|
special_tokens_map.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
|
tokenizer.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|
tokenizer_config.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"do_lower_case": false, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "model_max_length": 512, "special_tokens_map_file": null, "name_or_path": "releases/sbert-chinese-qmc-domain-v1-distill/", "tokenizer_class": "BertTokenizer"}
|
vocab.txt
ADDED
The diff for this file is too large to render.
See raw diff
|
|