piotr-rybak commited on
Commit
3a32c99
·
1 Parent(s): 878f739

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +15 -9
README.md CHANGED
@@ -12,13 +12,19 @@ datasets:
12
  - ipipan/maupqa
13
  ---
14
 
15
- # HerBERT-base Retrieval (v2)
16
 
17
- HerBERT Retrieval model encodes the Polish sentences or paragraphs into a 768-dimensional dense vector space and can be used for tasks like document retrieval or semantic search.
18
 
19
- It was initialized from the [HerBERT-base](https://huggingface.co/allegro/herbert-base-cased) model and fine-tuned on the [PolQA](https://huggingface.co/ipipan/polqa) and [MAUPQA](https://huggingface.co/ipipan/maupqa) datasets for 40,000 steps with a batch size of 256.
 
 
 
 
 
 
 
20
 
21
- The model was trained on question-passage pairs and works best on similar tasks. The training passages consisted of `title` and `text` concatenated with the special token `</s>`. Even if your passages don't have a `title`, it is still beneficial to prefix a passage `text` with the `</s>` token.
22
  ## Usage (Sentence-Transformers)
23
 
24
  Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
@@ -32,11 +38,11 @@ Then you can use the model like this:
32
  ```python
33
  from sentence_transformers import SentenceTransformer
34
  sentences = [
35
- "W jakim mieście urodził się Zbigniew Herbert?",
36
  "Zbigniew Herbert</s>Zbigniew Bolesław Ryszard Herbert (ur. 29 października 1924 we Lwowie, zm. 28 lipca 1998 w Warszawie) – polski poeta, eseista i dramaturg.",
37
  ]
38
 
39
- model = SentenceTransformer('ipipan/herbert-base-retrieval-v2')
40
  embeddings = model.encode(sentences)
41
  print(embeddings)
42
  ```
@@ -55,12 +61,12 @@ def cls_pooling(model_output, attention_mask):
55
 
56
  # Sentences we want sentence embeddings for
57
  sentences = [
58
- "W jakim mieście urodził się Zbigniew Herbert?",
59
  "Zbigniew Herbert</s>Zbigniew Bolesław Ryszard Herbert (ur. 29 października 1924 we Lwowie, zm. 28 lipca 1998 w Warszawie) – polski poeta, eseista i dramaturg.",
60
  ]
61
  # Load model from HuggingFace Hub
62
- tokenizer = AutoTokenizer.from_pretrained('ipipan/herbert-base-retrieval-v2')
63
- model = AutoModel.from_pretrained('ipipan/herbert-base-retrieval-v2')
64
 
65
  # Tokenize sentences
66
  encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
 
12
  - ipipan/maupqa
13
  ---
14
 
15
+ # Silver Retriever Base (v1)
16
 
17
+ Silver Retriever model encodes the Polish sentences or paragraphs into a 768-dimensional dense vector space and can be used for tasks like document retrieval or semantic search.
18
 
19
+ It was initialized from the [HerBERT-base](https://huggingface.co/allegro/herbert-base-cased) model and fine-tuned on the [PolQA](https://huggingface.co/ipipan/polqa) and [MAUPQA](https://huggingface.co/ipipan/maupqa) datasets for 15,000 steps with a batch size of 1,024.
20
+
21
+ ## Preparing inputs
22
+
23
+ The model was trained on question-passage pairs and works best when the input is the same format as that used during training:
24
+ - We added the phrase `Pytanie:' to the beginning of the question.
25
+ - The training passages consisted of `title` and `text` concatenated with the special token `</s>`. Even if your passages don't have a `title`, it is still beneficial to prefix a passage with the `</s>` token.
26
+ - Although we used the dot product during training, the model usually works better with the cosine distance.
27
 
 
28
  ## Usage (Sentence-Transformers)
29
 
30
  Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
 
38
  ```python
39
  from sentence_transformers import SentenceTransformer
40
  sentences = [
41
+ "Pytanie: W jakim mieście urodził się Zbigniew Herbert?",
42
  "Zbigniew Herbert</s>Zbigniew Bolesław Ryszard Herbert (ur. 29 października 1924 we Lwowie, zm. 28 lipca 1998 w Warszawie) – polski poeta, eseista i dramaturg.",
43
  ]
44
 
45
+ model = SentenceTransformer('ipipan/silver-retriever-base-v1')
46
  embeddings = model.encode(sentences)
47
  print(embeddings)
48
  ```
 
61
 
62
  # Sentences we want sentence embeddings for
63
  sentences = [
64
+ "Pytanie: W jakim mieście urodził się Zbigniew Herbert?",
65
  "Zbigniew Herbert</s>Zbigniew Bolesław Ryszard Herbert (ur. 29 października 1924 we Lwowie, zm. 28 lipca 1998 w Warszawie) – polski poeta, eseista i dramaturg.",
66
  ]
67
  # Load model from HuggingFace Hub
68
+ tokenizer = AutoTokenizer.from_pretrained('ipipan/silver-retriever-base-v1')
69
+ model = AutoModel.from_pretrained('ipipan/silver-retriever-base-v1')
70
 
71
  # Tokenize sentences
72
  encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')