File size: 2,843 Bytes
606355b c627a80 4ad6b55 06220cf cb81375 06220cf ed067ea 06220cf ca2a8a4 06220cf edc3d16 06220cf 948b32f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 |
---
license: mit
language:
- ru
- en
pipeline_tag: sentence-similarity
tags:
- russian
- fill-mask
- pretraining
- embeddings
- masked-lm
- tiny
- feature-extraction
- sentence-similarity
- sentence-transformers
- transformers
widget:
- text: Метод опорных векторов
---
SciRus-tiny is a model to obtain embeddings of scientific texts in russian and english. Model was trained on [eLibrary](https://www.elibrary.ru/) data with contrastive technics described in [habr post](https://habr.com/ru/articles/781032). High metrics values were achieved on the [ruSciBench](https://github.com/mlsa-iai-msu-lab/ru_sci_bench/tree/main) benchmark.
### How to get embeddings
```python
from transformers import AutoTokenizer, AutoModel
import torch.nn.functional as F
import torch
tokenizer = AutoTokenizer.from_pretrained("mlsa-iai-msu-lab/sci-rus-tiny")
model = AutoModel.from_pretrained("mlsa-iai-msu-lab/sci-rus-tiny")
# model.cuda() # if you want to use a GPU
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0] #First element of model_output contains all token embeddings
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
def get_sentence_embedding(title, abstract, model, tokenizer, max_length=None):
# Tokenize sentences
sentence = '</s>'.join([title, abstract])
encoded_input = tokenizer(
[sentence], padding=True, truncation=True, return_tensors='pt', max_length=max_length).to(model.device)
# Compute token embeddings
with torch.no_grad():
model_output = model(**encoded_input)
# Perform pooling
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
# Normalize embeddings
sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
return sentence_embeddings.cpu().detach().numpy()[0]
print(get_sentence_embedding('some title', 'some abstract', model, tokenizer).shape)
# (312,)
```
Or you can use the `sentence_transformers`:
```Python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('mlsa-iai-msu-lab/sci-rus-tiny')
embeddings = model.encode(['some title' + '</s>' + 'some abstract'])
print(embeddings[0].shape)
# (312,)
```
### Authors
Benchmark developed by MLSA Lab of Institute for AI, MSU.
### Acknowledgement
The research is part of the project #23-Ш05-21 SES MSU "Development of mathematical methods of machine learning for processing large-volume textual scientific information". We would like to thank [eLibrary](https://elibrary.ru/) for provided datasets.
### Contacts
Nikolai Gerasimenko ([email protected]), Alexey Vatolin ([email protected]) |