piotr-rybak commited on
Commit
1d072d6
·
1 Parent(s): 899471f

init commit

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ pytorch_model.bin filter=lfs diff=lfs merge=lfs -text
1_Pooling/config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 768,
3
+ "pooling_mode_cls_token": true,
4
+ "pooling_mode_mean_tokens": false,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false
7
+ }
README.md ADDED
@@ -0,0 +1,150 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: sentence-similarity
3
+ language:
4
+ - pl
5
+ tags:
6
+ - sentence-transformers
7
+ - feature-extraction
8
+ - sentence-similarity
9
+ - transformers
10
+ datasets:
11
+ - ipipan/polqa
12
+ - ipipan/maupqa
13
+ license: cc-by-sa-4.0
14
+ widget:
15
+ - source_sentence: "Pytanie: W jakim mieście urodził się Zbigniew Herbert?"
16
+ sentences:
17
+ - "Zbigniew Herbert</s>Zbigniew Bolesław Ryszard Herbert (ur. 29 października 1924 we Lwowie, zm. 28 lipca 1998 w Warszawie) – polski poeta, eseista i dramaturg."
18
+ - "Zbigniew Herbert</s>Lato 1968 Herbert spędził w USA (na zaproszenie Poetry Center)."
19
+ - "Herbert George Wells</s>Herbert George Wells (ur. 21 września 1866 w Bromley, zm. 13 sierpnia 1946 w Londynie) – brytyjski pisarz i biolog."
20
+ example_title: "Zbigniew Herbert"
21
+ ---
22
+
23
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/5eb2c5ef4e876668a0c3779e/j2JE7_VnbRifCmV7_4BP9.png)
24
+
25
+ # Silver Retriever Base (v1.1)
26
+
27
+ Silver Retriever model encodes the Polish sentences or paragraphs into a 768-dimensional dense vector space and can be used for tasks like document retrieval or semantic search.
28
+
29
+ It was initialized from the [HerBERT-base](https://huggingface.co/allegro/herbert-base-cased) model and fine-tuned on the [PolQA](https://huggingface.co/ipipan/polqa) and [MAUPQA](https://huggingface.co/ipipan/maupqa) datasets for 8,000 steps with a batch size of 8,192. Please refer to the [SilverRetriever: Advancing Neural Passage Retrieval for Polish Question Answering](https://arxiv.org/abs/2309.08469) for more details.
30
+
31
+ ## Evaluation
32
+
33
+
34
+ | **Model** | **Average [Acc]** | **Average [NDCG]** | [**PolQA**](https://huggingface.co/datasets/ipipan/polqa) **[Acc]** | [**PolQA**](https://huggingface.co/datasets/ipipan/polqa) **[NDCG]** | [**Allegro FAQ**](https://huggingface.co/datasets/piotr-rybak/allegro-faq) **[Acc]** | [**Allegro FAQ**](https://huggingface.co/datasets/piotr-rybak/allegro-faq) **[NDCG]** | [**Legal Questions**](https://huggingface.co/datasets/piotr-rybak/legal-questions) **[Acc]** | [**Legal Questions**](https://huggingface.co/datasets/piotr-rybak/legal-questions) **[NDCG]** |
35
+ |--------------------:|------------:|-------------:|------------:|-------------:|------------:|-------------:|------------:|-------------:|
36
+ | BM25 | 74.87 | 51.81 | 61.35 | 24.51 | 66.89 | 48.71 | 96.38 | **82.21** |
37
+ | BM25 (lemma) | 80.46 | 55.44 | 71.49 | 31.97 | 75.33 | 55.70 | 94.57 | 78.65 |
38
+ | [MiniLM-L12-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2) | 62.62 | 39.21 | 37.24 | 11.93 | 71.67 | 51.25 | 78.97 | 54.44 |
39
+ | [LaBSE](https://huggingface.co/sentence-transformers/LaBSE) | 64.89 | 39.47 | 46.23 | 15.53 | 67.11 | 46.71 | 81.34 | 56.16 |
40
+ | [mContriever-Base](https://huggingface.co/nthakur/mcontriever-base-msmarco) | 86.31 | 60.37 | 78.66 | 36.30 | 84.44 | 67.38 | 95.82 | 77.42 |
41
+ | [E5-Base](https://huggingface.co/intfloat/multilingual-e5-base) | 91.58 | 66.56 | 86.61 | **46.08** | 91.89 | 75.90 | 96.24 | 77.69 |
42
+ | [ST-DistilRoBERTa](https://huggingface.co/sdadas/st-polish-paraphrase-from-distilroberta) | 73.78 | 48.29 | 48.43 | 16.73 | 84.89 | 64.39 | 88.02 | 63.76 |
43
+ | [ST-MPNet](https://huggingface.co/sdadas/st-polish-paraphrase-from-mpnet) | 76.66 | 49.99 | 56.80 | 21.55 | 86.00 | 65.44 | 87.19 | 62.99 |
44
+ | [HerBERT-QA](https://huggingface.co/ipipan/herbert-base-qa-v1) | 84.23 | 54.36 | 75.84 | 32.52 | 85.78 | 63.58 | 91.09 | 66.99 |
45
+ | [Silver Retriever v1](https://huggingface.co/ipipan/silver-retriever-base-v1) | 92.45 | 66.72 | 87.24 | 43.40 | **94.56** | 79.66 | 95.54 | 77.10 |
46
+ | [Silver Retriever v1.1](https://huggingface.co/ipipan/silver-retriever-base-v1.1) | **93.18** | **67.55** | **88.60** | 44.88 | 94.00 | **79.83** | **96.94** | 77.95 |
47
+
48
+ Legend:
49
+ - **Acc** is the Accuracy at 10
50
+ - **NDCG** is the Normalized Discounted Cumulative Gain at 10
51
+
52
+
53
+ ## Usage
54
+
55
+ ### Preparing inputs
56
+
57
+ The model was trained on question-passage pairs and works best when the input is the same format as that used during training:
58
+ - We added the phrase `Pytanie:` to the beginning of the question.
59
+ - The training passages consisted of `title` and `text` concatenated with the special token `</s>`. Even if your passages don't have a `title`, it is still beneficial to prefix a passage with the `</s>` token.
60
+ - Although we used the dot product during training, the model usually works better with the cosine distance.
61
+
62
+ ### Inference with Sentence-Transformers
63
+
64
+ Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
65
+
66
+ ```
67
+ pip install -U sentence-transformers
68
+ ```
69
+
70
+ Then you can use the model like this:
71
+
72
+ ```python
73
+ from sentence_transformers import SentenceTransformer
74
+ sentences = [
75
+ "Pytanie: W jakim mieście urodził się Zbigniew Herbert?",
76
+ "Zbigniew Herbert</s>Zbigniew Bolesław Ryszard Herbert (ur. 29 października 1924 we Lwowie, zm. 28 lipca 1998 w Warszawie) – polski poeta, eseista i dramaturg.",
77
+ ]
78
+
79
+ model = SentenceTransformer('ipipan/silver-retriever-base-v1.1')
80
+ embeddings = model.encode(sentences)
81
+ print(embeddings)
82
+ ```
83
+
84
+ ### Inference with HuggingFace Transformers
85
+ Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
86
+
87
+ ```python
88
+ from transformers import AutoTokenizer, AutoModel
89
+ import torch
90
+
91
+
92
+ def cls_pooling(model_output, attention_mask):
93
+ return model_output[0][:,0]
94
+
95
+
96
+ # Sentences we want sentence embeddings for
97
+ sentences = [
98
+ "Pytanie: W jakim mieście urodził się Zbigniew Herbert?",
99
+ "Zbigniew Herbert</s>Zbigniew Bolesław Ryszard Herbert (ur. 29 października 1924 we Lwowie, zm. 28 lipca 1998 w Warszawie) – polski poeta, eseista i dramaturg.",
100
+ ]
101
+ # Load model from HuggingFace Hub
102
+ tokenizer = AutoTokenizer.from_pretrained('ipipan/silver-retriever-base-v1.1')
103
+ model = AutoModel.from_pretrained('ipipan/silver-retriever-base-v1.1')
104
+
105
+ # Tokenize sentences
106
+ encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
107
+
108
+ # Compute token embeddings
109
+ with torch.no_grad():
110
+ model_output = model(**encoded_input)
111
+
112
+ # Perform pooling. In this case, cls pooling.
113
+ sentence_embeddings = cls_pooling(model_output, encoded_input['attention_mask'])
114
+
115
+ print("Sentence embeddings:")
116
+ print(sentence_embeddings)
117
+ ```
118
+
119
+ ## Full Model Architecture
120
+ ```
121
+ SentenceTransformer(
122
+ (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
123
+ (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
124
+ )
125
+ ```
126
+
127
+ ## Additional Information
128
+
129
+ ### Model Creators
130
+
131
+ The model was created by Piotr Rybak from the [Institute of Computer Science, Polish Academy of Sciences](http://zil.ipipan.waw.pl/).
132
+
133
+ This work was supported by the European Regional Development Fund as a part of 2014–2020 Smart Growth Operational Programme, CLARIN — Common Language Resources and Technology Infrastructure, project no. POIR.04.02.00-00C002/19.
134
+
135
+ ### Licensing Information
136
+
137
+ CC BY-SA 4.0
138
+
139
+ ### Citation Information
140
+
141
+ ```
142
+ @misc{rybak2023silverretriever,
143
+ title={SilverRetriever: Advancing Neural Passage Retrieval for Polish Question Answering},
144
+ author={Piotr Rybak and Maciej Ogrodniczuk},
145
+ year={2023},
146
+ eprint={2309.08469},
147
+ archivePrefix={arXiv},
148
+ primaryClass={cs.CL}
149
+ }
150
+ ```
config.json ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "../../../sellaservice/rt/model_all_filter5p_bs256_lr2e5_40k/",
3
+ "architectures": [
4
+ "BertModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
+ "hidden_act": "gelu",
9
+ "hidden_dropout_prob": 0.1,
10
+ "hidden_size": 768,
11
+ "id2label": {
12
+ "0": "LABEL_0"
13
+ },
14
+ "initializer_range": 0.02,
15
+ "intermediate_size": 3072,
16
+ "label2id": {
17
+ "LABEL_0": 0
18
+ },
19
+ "layer_norm_eps": 1e-12,
20
+ "max_position_embeddings": 514,
21
+ "model_type": "bert",
22
+ "num_attention_heads": 12,
23
+ "num_hidden_layers": 12,
24
+ "pad_token_id": 1,
25
+ "position_embedding_type": "absolute",
26
+ "tokenizer_class": "HerbertTokenizerFast",
27
+ "torch_dtype": "float32",
28
+ "transformers_version": "4.30.1",
29
+ "type_vocab_size": 2,
30
+ "use_cache": true,
31
+ "vocab_size": 50000
32
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "2.2.2",
4
+ "transformers": "4.30.1",
5
+ "pytorch": "2.0.1"
6
+ }
7
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
modules.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ }
14
+ ]
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c2393cdcf4e284fbfe7a038fc4b550168ce6a2024d8a7693ed92b9996d7c43e2
3
+ size 497843178
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 512,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<s>",
3
+ "cls_token": "<s>",
4
+ "mask_token": "<mask>",
5
+ "pad_token": "<pad>",
6
+ "sep_token": "</s>",
7
+ "unk_token": "<unk>"
8
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [],
3
+ "bos_token": "<s>",
4
+ "clean_up_tokenization_spaces": true,
5
+ "cls_token": "<s>",
6
+ "do_lowercase_and_remove_accent": false,
7
+ "id2lang": null,
8
+ "lang2id": null,
9
+ "mask_token": "<mask>",
10
+ "model_max_length": 512,
11
+ "pad_token": "<pad>",
12
+ "sep_token": "</s>",
13
+ "tokenizer_class": "HerbertTokenizer",
14
+ "unk_token": "<unk>"
15
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff