Our original base similarity Matryoshka
This is a [sentence-transformers] model finetuned from Ghani-25/LF_enrich_sim on the json dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
Model Details
Model Description
- Model Type: Sentence Transformer
- Base model: Ghani-25/LF_enrich_sim
- Maximum Sequence Length: 128 tokens
- Output Dimensionality: 768 dimensions
- Similarity Function: Cosine Similarity
- Training Dataset:
- json
- Language: multilingual
- License: apache-2.0
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)
Usage
Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("Ghani-25/LF-enrich-sim-matryoshka-64")
# Run inference
sentences = [
'Summer Job: Export Manager',
'Responsable Export Afrique Amériquess
'Clinical Project Leader',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
# Extraction de la diagonale pour obtenir les similarités correspondantes
similarities_diagonal = similarities.diag().cpu().numpy()
print(similarities_diagonal)
# [0.896542]
Evaluation
Metrics
Semantic Similarity
- Datasets:
dim_768
,dim_512
,dim_256
,dim_128
anddim_64
- Evaluated with
EmbeddingSimilarityEvaluator
Metric | dim_768 | dim_512 | dim_256 | dim_128 | dim_64 |
---|---|---|---|---|---|
pearson_cosine | 0.9696 | 0.9693 | 0.9662 | 0.9606 | 0.9464 |
spearman_cosine | 0.9472 | 0.9466 | 0.9408 | 0.9315 | 0.9101 |
Training Details
Training Dataset
json
- Dataset: json
- Columns:
sentence1
,sentence2
, andlabel
- Approximate statistics based on the first 1000 samples:
sentence1 sentence2 label type string string float details - min: 3 tokens
- mean: 10.22 tokens
- max: 30 tokens
- min: 3 tokens
- mean: 9.98 tokens
- max: 67 tokens
- min: -0.05
- mean: 0.37
- max: 0.98
- Samples:
sentence1 sentence2 label Contributive filmer
Doctorant contractuel (2016-2019)
0.20986526
Responsable Développement et Communication
Bilingual Business Assistant
0.3238712
Law Trainee
Sales Director contract manager
0.24983984
- Loss:
MatryoshkaLoss
with these parameters:{ "loss": "CosineSimilarityLoss", "matryoshka_dims": [ 768, 512, 256, 128, 64 ], "matryoshka_weights": [ 1, 1, 1, 1, 1 ], "n_dims_per_step": -1 }
Training Hyperparameters
Non-Default Hyperparameters
eval_strategy
: epochper_device_train_batch_size
: 32per_device_eval_batch_size
: 16gradient_accumulation_steps
: 16learning_rate
: 2e-05num_train_epochs
: 4lr_scheduler_type
: cosinewarmup_ratio
: 0.1bf16
: Truetf32
: Trueload_best_model_at_end
: Trueoptim
: adamw_torch_fused
All Hyperparameters
Contact the author.
Training Logs
Epoch | Step | Training Loss | dim_768_spearman_cosine | dim_512_spearman_cosine | dim_256_spearman_cosine | dim_128_spearman_cosine | dim_64_spearman_cosine |
---|---|---|---|---|---|---|---|
0.1624 | 10 | 0.0669 | - | - | - | - | - |
0.3249 | 20 | 0.0563 | - | - | - | - | - |
0.4873 | 30 | 0.0496 | - | - | - | - | - |
0.6497 | 40 | 0.0456 | - | - | - | - | - |
0.8122 | 50 | 0.0418 | - | - | - | - | - |
0.9746 | 60 | 0.0407 | - | - | - | - | - |
0.9909 | 61 | - | 0.9223 | 0.9199 | 0.9087 | 0.8920 | 0.8586 |
1.1371 | 70 | 0.0326 | - | - | - | - | - |
1.2995 | 80 | 0.0312 | - | - | - | - | - |
1.4619 | 90 | 0.0303 | - | - | - | - | - |
1.6244 | 100 | 0.03 | - | - | - | - | - |
1.7868 | 110 | 0.0291 | - | - | - | - | - |
1.9492 | 120 | 0.0301 | - | - | - | - | - |
1.9980 | 123 | - | 0.9393 | 0.9382 | 0.9304 | 0.9191 | 0.8946 |
2.1117 | 130 | 0.0257 | - | - | - | - | - |
2.2741 | 140 | 0.0243 | - | - | - | - | - |
2.4365 | 150 | 0.0246 | - | - | - | - | - |
2.5990 | 160 | 0.0235 | - | - | - | - | - |
2.7614 | 170 | 0.024 | - | - | - | - | - |
2.9239 | 180 | 0.023 | - | - | - | - | - |
2.9888 | 184 | - | 0.9464 | 0.9457 | 0.9396 | 0.9301 | 0.9083 |
3.0863 | 190 | 0.0222 | - | - | - | - | - |
3.2487 | 200 | 0.022 | - | - | - | - | - |
3.4112 | 210 | 0.022 | - | - | - | - | - |
3.5736 | 220 | 0.0226 | - | - | - | - | - |
3.7360 | 230 | 0.021 | - | - | - | - | - |
3.8985 | 240 | 0.0224 | - | - | - | - | - |
3.9635 | 244 | - | 0.9472 | 0.9466 | 0.9408 | 0.9315 | 0.9101 |
- The bold row denotes the saved checkpoint.
Framework Versions
- Python: 3.10.12
- Sentence Transformers: 3.3.1
- Transformers: 4.41.2
- PyTorch: 2.5.1+cu121
- Accelerate: 1.1.1
- Datasets: 2.19.1
- Tokenizers: 0.19.1
- Downloads last month
- 802
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.
Model tree for Ghani-25/LF-enrich-sim-matryoshka-64
Base model
Ghani-25/LF_enrich_simEvaluation results
- Pearson Cosine on dim 768self-reported0.970
- Spearman Cosine on dim 768self-reported0.947
- Pearson Cosine on dim 512self-reported0.969
- Spearman Cosine on dim 512self-reported0.947
- Pearson Cosine on dim 256self-reported0.966
- Spearman Cosine on dim 256self-reported0.941
- Pearson Cosine on dim 128self-reported0.961
- Spearman Cosine on dim 128self-reported0.931
- Pearson Cosine on dim 64self-reported0.946
- Spearman Cosine on dim 64self-reported0.910