metadata
license: mit
language:
- ru
- en
pipeline_tag: sentence-similarity
tags:
- russian
- fill-mask
- pretraining
- embeddings
- masked-lm
- tiny
- feature-extraction
- sentence-similarity
- sentence-transformers
- transformers
widget:
- text: Метод опорных векторов
SciRus-tiny is a model to obtain embeddings of scientific texts in russian and english. Model was trained on eLibrary data with contrastive technics described in habr post. High metrics values were achieved on the ruSciBench benchmark.
How to get embeddings
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F
tokenizer = AutoTokenizer.from_pretrained("mlsa-iai-msu-lab/sci-rus-tiny")
model = AutoModel.from_pretrained("mlsa-iai-msu-lab/sci-rus-tiny")
# model.cuda() # if you want to use a GPU
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0] #First element of model_output contains all token embeddings
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
def get_sentence_embedding(title, abstract, model, tokenizer, max_length=None):
# Tokenize sentences
sentence = '</s>'.join([title, abstract])
encoded_input = tokenizer(
[sentence], padding=True, truncation=True, return_tensors='pt', max_length=max_length).to(model.device)
# Compute token embeddings
with torch.no_grad():
model_output = model(**encoded_input)
# Perform pooling
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
# Normalize embeddings
sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
return sentence_embeddings.cpu().detach().numpy()[0]
print(get_sentence_embedding('some title', 'some abstract', model, tokenizer).shape)
# (312,)
Or you can use the sentence_transformers
:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('mlsa-iai-msu-lab/sci-rus-tiny')
embeddings = model.encode(['привет мир'])
print(embeddings[0].shape)
# (312,)
Authors
Benchmark developed by MLSA Lab of Institute for AI, MSU.
Acknowledgement
The research is part of the project #23-Ш05-21 SES MSU "Development of mathematical methods of machine learning for processing large-volume textual scientific information". We would like to thank eLibrary for provided datasets.
Contacts
Nikolai Gerasimenko ([email protected]), Alexey Vatolin ([email protected])