File size: 2,843 Bytes
606355b
 
 
 
 
c627a80
 
 
 
 
 
 
 
 
 
 
 
4ad6b55
 
06220cf
cb81375
06220cf
 
 
 
 
 
ed067ea
06220cf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ca2a8a4
06220cf
 
 
 
 
 
 
 
 
edc3d16
06220cf
948b32f
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
---
license: mit
language:
- ru
- en
pipeline_tag: sentence-similarity
tags:
  - russian
  - fill-mask
  - pretraining
  - embeddings
  - masked-lm
  - tiny
  - feature-extraction
  - sentence-similarity
  - sentence-transformers
  - transformers
widget:
- text: Метод опорных векторов
---
SciRus-tiny is a model to obtain embeddings of scientific texts in russian and english. Model was trained on [eLibrary](https://www.elibrary.ru/) data with contrastive technics described in [habr post](https://habr.com/ru/articles/781032). High metrics values were achieved on the [ruSciBench](https://github.com/mlsa-iai-msu-lab/ru_sci_bench/tree/main) benchmark.

### How to get embeddings

```python
from transformers import AutoTokenizer, AutoModel
import torch.nn.functional as F
import torch


tokenizer = AutoTokenizer.from_pretrained("mlsa-iai-msu-lab/sci-rus-tiny")
model = AutoModel.from_pretrained("mlsa-iai-msu-lab/sci-rus-tiny")
# model.cuda()  # if you want to use a GPU

def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


def get_sentence_embedding(title, abstract, model, tokenizer, max_length=None):
    # Tokenize sentences
    sentence = '</s>'.join([title, abstract])
    encoded_input = tokenizer(
        [sentence], padding=True, truncation=True, return_tensors='pt', max_length=max_length).to(model.device)
    # Compute token embeddings
    with torch.no_grad():
        model_output = model(**encoded_input)
    # Perform pooling
    sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
    # Normalize embeddings
    sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
    return sentence_embeddings.cpu().detach().numpy()[0]

print(get_sentence_embedding('some title', 'some abstract', model, tokenizer).shape)
# (312,)
```

Or you can use the `sentence_transformers`:
```Python
from sentence_transformers import SentenceTransformer


model = SentenceTransformer('mlsa-iai-msu-lab/sci-rus-tiny')
embeddings = model.encode(['some title' + '</s>' + 'some abstract'])
print(embeddings[0].shape)
# (312,)
```


### Authors
Benchmark developed by MLSA Lab of Institute for AI, MSU.

### Acknowledgement
The research is part of the project #23-Ш05-21 SES MSU "Development of mathematical methods of machine learning for processing large-volume textual scientific information". We would like to thank [eLibrary](https://elibrary.ru/) for provided datasets.

### Contacts
Nikolai Gerasimenko ([email protected]), Alexey Vatolin ([email protected])