README.md · mlsa-iai-msu-lab/sci-rus-tiny at refs/pr/1

sci-rus-tiny / README.md

mlsa-iai-msu-lab

Update README.md

ca2a8a4 verified 11 months ago

preview code

raw

history blame

2.84 kB

	---
	license: mit
	language:
	- ru
	- en
	pipeline_tag: sentence-similarity
	tags:
	- russian
	- fill-mask
	- pretraining
	- embeddings
	- masked-lm
	- tiny
	- feature-extraction
	- sentence-similarity
	- sentence-transformers
	- transformers
	widget:
	- text: Метод опорных векторов
	---
	SciRus-tiny is a model to obtain embeddings of scientific texts in russian and english. Model was trained on [eLibrary](https://www.elibrary.ru/) data with contrastive technics described in [habr post](https://habr.com/ru/articles/781032). High metrics values were achieved on the [ruSciBench](https://github.com/mlsa-iai-msu-lab/ru_sci_bench/tree/main) benchmark.

	### How to get embeddings

	```python
	from transformers import AutoTokenizer, AutoModel
	import torch.nn.functional as F
	import torch


	tokenizer = AutoTokenizer.from_pretrained("mlsa-iai-msu-lab/sci-rus-tiny")
	model = AutoModel.from_pretrained("mlsa-iai-msu-lab/sci-rus-tiny")
	# model.cuda() # if you want to use a GPU

	def mean_pooling(model_output, attention_mask):
	token_embeddings = model_output[0] #First element of model_output contains all token embeddings
	input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
	return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


	def get_sentence_embedding(title, abstract, model, tokenizer, max_length=None):
	# Tokenize sentences
	sentence = '</s>'.join([title, abstract])
	encoded_input = tokenizer(
	[sentence], padding=True, truncation=True, return_tensors='pt', max_length=max_length).to(model.device)
	# Compute token embeddings
	with torch.no_grad():
	model_output = model(**encoded_input)
	# Perform pooling
	sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
	# Normalize embeddings
	sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
	return sentence_embeddings.cpu().detach().numpy()[0]

	print(get_sentence_embedding('some title', 'some abstract', model, tokenizer).shape)
	# (312,)
	```

	Or you can use the `sentence_transformers`:
	```Python
	from sentence_transformers import SentenceTransformer


	model = SentenceTransformer('mlsa-iai-msu-lab/sci-rus-tiny')
	embeddings = model.encode(['some title' + '</s>' + 'some abstract'])
	print(embeddings[0].shape)
	# (312,)
	```


	### Authors
	Benchmark developed by MLSA Lab of Institute for AI, MSU.

	### Acknowledgement
	The research is part of the project #23-Ш05-21 SES MSU "Development of mathematical methods of machine learning for processing large-volume textual scientific information". We would like to thank [eLibrary](https://elibrary.ru/) for provided datasets.

	### Contacts
	Nikolai Gerasimenko ([email protected]), Alexey Vatolin ([email protected])