ai-forever
/

FRIDA

+---
+# Model Card for FRIDA
+The FRIDA is a general text embedding model for Russian. The model is based on the encoder part of FRED-T5 (https://huggingface.co/ai-forever/FRED-T5-1.7B). It has been pre-trained on a Russian-English dataset and fine-tuned for improved performance on the target task.
+For more model details please refer to our [article](TODO).
+## Usage
+The model can be used as is with prefixes. It is recommended to use CLS pooling. The choice of prefix and pooling depends on the task.
+We use the following basic rules to choose a prefix:
+- `"search_query: "` and `"search_document: "` prefixes are for answer or relevant paragraph retrieval
+- `"paraphrase: "` prefix is for symmetric paraphrasing related tasks (STS, paraphrase mining, deduplication)
+- `"categorize: "` prefix is for asymmetric matching of document title and body (e.g. news, scientific papers, social posts)
+- `"categorize_sentiment: "` prefix is for any tasks that rely on sentiment features (e.g. hate, toxic, emotion)
+- `"categorize_topic: "` prefix is intended for tasks where you need to group texts by topic
+- `"categorize_entailment: "` prefix is for textual entailment task (NLI)
+To better tailor the model to your needs, you can fine-tune it with relevant high-quality Russian and English datasets.
+Below are examples of texts encoding using the Transformers and SentenceTransformers libraries.
+### Transformers
+```python
+import torch
+import torch.nn.functional as F
+from transformers import AutoTokenizer, T5EncoderModel
+def pool(hidden_state, mask, pooling_method="cls"):
+    if pooling_method == "mean":
+        s = torch.sum(hidden_state * mask.unsqueeze(-1).float(), dim=1)
+        d = mask.sum(axis=1, keepdim=True).float()
+        return s / d
+    elif pooling_method == "cls":
+        return hidden_state[:, 0]
+inputs = [
+    #
+    "paraphrase: Он нам и <unk> не нужон ваш Интернет!",
+    "categorize_entailment: В Ярославской области разрешили работу бань, но без посетителей",
+    "search_query: Сколько программистов нужно, чтобы вкрутить лампочку?",
+    #
+    "paraphrase: What a time to be alive!",
+    "categorize_entailment: Ярославским баням разрешили работать без посетителей",
+    "search_document: Чтобы вкрутить лампочку, требуется три программиста: один напишет программу извлечения лампочки, другой — вкручивания лампочки, а третий проведет тестирование.",
+]
+tokenizer = AutoTokenizer.from_pretrained("ai-forever/FRIDA")
+model = T5EncoderModel.from_pretrained("ai-forever/FRIDA")
+tokenized_inputs = tokenizer(inputs, max_length=512, padding=True, truncation=True, return_tensors="pt")
+with torch.no_grad():
+    outputs = model(**tokenized_inputs)
+embeddings = pool(
+    outputs.last_hidden_state,
+    tokenized_inputs["attention_mask"],
+    pooling_method="cls" # or try "mean"
+)
+embeddings = F.normalize(embeddings, p=2, dim=1)
+sim_scores = embeddings[:3] @ embeddings[3:].T
+print(sim_scores.diag().tolist())
+# [0.4796873927116394, 0.9409002065658569, 0.7761015892028809]
+```
+### SentenceTransformers
+```python
+from sentence_transformers import SentenceTransformer
+inputs = [
+    #
+    "paraphrase: Он нам и <unk> не нужон ваш Интернет!",
+    "categorize_entailment: В Ярославской области разрешили работу бань, но без посетителей",
+    "search_query: Сколько программистов нужно, чтобы вкрутить лампочку?",
+    #
+    "paraphrase: What a time to be alive!",
+    "categorize_entailment: Ярославским баням разрешили работать без посетителей",
+    "search_document: Чтобы вкрутить лампочку, требуется три программиста: один напишет программу извлечения лампочки, другой — вкручивания лампочки, а третий проведет тестирование.",
+]
+# loads model with CLS pooling
+model = SentenceTransformer("ai-forever/FRIDA")
+# embeddings are normalized by default
+embeddings = model.encode(inputs, convert_to_tensor=True)
+sim_scores = embeddings[:3] @ embeddings[3:].T
+print(sim_scores.diag().tolist())
+# [0.47968706488609314, 0.940900444984436, 0.7761018872261047]
+```
+or using prompts (sentence-transformers>=2.4.0):
+```python
+from sentence_transformers import SentenceTransformer
+# loads model with CLS pooling
+model = SentenceTransformer("ai-forever/FRIDA")
+classification = model.encode(["Он нам и <unk> не нужон ваш Интернет!", "What a time to be alive!"], prompt_name="paraphrase")
+print(classification[0] @ classification[1].T) # 0.47968706488609314
+clustering = model.encode(["В Ярославской области разрешили работу бань, но без посетителей", "Ярославским баням разрешили работать без посетителей"], prompt_name="categorize_entailment")
+print(clustering[0] @ clustering[1].T) # 0.940900444984436
+query_embedding = model.encode("Сколько программистов нужно, чтобы вкрутить лампочку?", prompt_name="search_query")
+document_embedding = model.encode("Чтобы вкрутить лампочку, требуется три программиста: один напишет программу извлечения лампочки, другой — вкручивания лампочки, а третий проведет тестирование.", prompt_name="search_document")
+print(query_embedding @ document_embedding.T) # 0.7761018872261047
+```
+## Citation
+```
+@misc{TODO
+}
+```
+## Limitations
+The model is designed to process texts in Russian, the quality in English is unknown. Maximum input text length is limited to 512 tokens.