AviLaBSE

Model description

This is a unified model trained over LaBSE by google LaBSE to add other row resourced language dimensions and then convereted to PyTorch. It can be used to map more than 250 languages to a shared vector space. The pre-training process combines masked language modeling with translation language modeling. The model is useful for getting multilingual sentence embeddings and for bi-text retrieval.

Usage

Using the model:

import torch
from transformers import BertModel, BertTokenizerFast


tokenizer = BertTokenizerFast.from_pretrained("sartifyllc/AviLaBSE")
model = BertModel.from_pretrained("sartifyllc/AviLaBSE")
model = model.eval()

english_sentences = [
    "dog",
    "Puppies are nice.",
    "I enjoy taking long walks along the beach with my dog.",
]
english_inputs = tokenizer(english_sentences, return_tensors="pt", padding=True)

with torch.no_grad():
    english_outputs = model(**english_inputs)

To get the sentence embeddings, use the pooler output:

english_embeddings = english_outputs.pooler_output

Output for other row resourced languages:

swahili_sentences = [
    "mbwa",
    "Mbwa ni mzuri.",
    "Ninafurahia kutembea kwa muda mrefu kando ya pwani na mbwa wangu.",
]
zulu_sentences = [
    "inja",
    "Inja iyavuma.",
    "Ngithanda ukubhema izinyawo ezidlula emanzini nabanye nomfana wami.",
]

igbo_sentences = [
    "nwa nkịta",
    "Nwa nkịta dị ọma.",
    "Achọrọ m gaa n'okirikiri na ụzọ nke oke na mgbidi na nwa nkịta m."
]

swahili_inputs = tokenizer(swahili_sentences, return_tensors="pt", padding=True)
zulu_inputs = tokenizer(zulu_sentences, return_tensors="pt", padding=True)
igbo_inputs=tokenizer(igbo_sentences, return_tensors="pt", padding=True)

with torch.no_grad():
    swahili_outputs = model(**swahili_inputs)
    zulu_outputs = model(**zulu_inputs)
    igbo_outputs =model(**igbo_inputs)

swahili_embeddings = swahili_outputs.pooler_output
zulu_embeddings = zulu_outputs.pooler_output
igbo_embeddings=igbo_outputs.pooler_output

For similarity between sentences, an L2-norm is recommended before calculating the similarity:

import torch.nn.functional as F

def similarity(embeddings_1, embeddings_2):
    normalized_embeddings_1 = F.normalize(embeddings_1, p=2)
    normalized_embeddings_2 = F.normalize(embeddings_2, p=2)
    return torch.matmul(
        normalized_embeddings_1, normalized_embeddings_2.transpose(0, 1)
    )


print(similarity(english_embeddings, swahili_embeddings))
print(similarity(english_embeddings, zulu_embeddings))
print(similarity(swahili_embeddings, igbo_embeddings))

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Dense({'in_features': 768, 'out_features': 768, 'bias': True, 'activation_function': 'torch.nn.modules.activation.Tanh'})
  (3): Normalize()
)
Downloads last month
37
Safetensors
Model size
471M params
Tensor type
I64
·
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Spaces using sartifyllc/African-Cross-Lingua-Embeddings-Model 2