SentenceTransformer based on sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2

This is a sentence-transformers model finetuned from sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 on the allstats-semantic-search-synthetic-dataset-v2 dataset. It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("yahyaabd/allstats-semantic-search-mini-model-v2")
# Run inference
sentences = [
    'Tahun berapa Rupiah terdepresiasi 0,23 persen terhadap Dolar Amerika?',
    'Depresiasi Rupiah terhadap Dolar Amerika pada tahun 2016 sebesar 0,5 persen.',
    'Ringkasan Neraca Arus Dana Triwulan Pertama, 2002, (Miliar Rupiah)',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Semantic Similarity

Metric allstats-semantic-search-mini-v2-eval allstat-semantic-search-mini-v2-test
pearson_cosine 0.9839 0.9831
spearman_cosine 0.8951 0.8922

Training Details

Training Dataset

allstats-semantic-search-synthetic-dataset-v2

  • Dataset: allstats-semantic-search-synthetic-dataset-v2 at c76f31a
  • Size: 244,856 training samples
  • Columns: query, doc, and label
  • Approximate statistics based on the first 1000 samples:
    query doc label
    type string string float
    details
    • min: 3 tokens
    • mean: 12.75 tokens
    • max: 45 tokens
    • min: 4 tokens
    • mean: 14.81 tokens
    • max: 56 tokens
    • min: 0.0
    • mean: 0.54
    • max: 1.0
  • Samples:
    query doc label
    Dtaa harg konsymen edesaan (non-makann) 201 Statistik Harga Konsumen Perdesaan Kelompok Nonmakanan (Data 2013) 0.95
    Bagaimna konidsi keuamgan rymah atngga Indonsia 2020-2022? Statistik Perusahaan Perikanan 2007 0.1
    Tingkat hunian kamar hotel tahun 2023 Tingkat Penghunian Kamar Hotel 2023 0.99
  • Loss: CosineSimilarityLoss with these parameters:
    {
        "loss_fct": "torch.nn.modules.loss.MSELoss"
    }
    

Evaluation Dataset

allstats-semantic-search-synthetic-dataset-v2

  • Dataset: allstats-semantic-search-synthetic-dataset-v2 at c76f31a
  • Size: 52,469 evaluation samples
  • Columns: query, doc, and label
  • Approximate statistics based on the first 1000 samples:
    query doc label
    type string string float
    details
    • min: 3 tokens
    • mean: 13.04 tokens
    • max: 43 tokens
    • min: 4 tokens
    • mean: 15.01 tokens
    • max: 54 tokens
    • min: 0.0
    • mean: 0.52
    • max: 1.0
  • Samples:
    query doc label
    Bulan apa NTP mengalami kenaikan 0,25 persen? Jumlah Wisatawan Mancanegara Bulan Agustus 2009 Turun 4,49 Persen Dibandingkan Bulan Sebelumnya. 0.0
    Sebutksn keempa komositi tang disebutkn besert persentae mrajin persagangannya. Marjin Perdagangan Minyak Goreng 3,86 Persen, Terigu 5,92 Persen, Garam 23,74 Persen, Dan Susu Bubuk 13,02 Persen 1.0
    Data kemiskinan per kabupaten/kota tahun 2007 Data dan Informasi Kemiskinan 2007 Buku 2: Kabupaten/Kota 0.87
  • Loss: CosineSimilarityLoss with these parameters:
    {
        "loss_fct": "torch.nn.modules.loss.MSELoss"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 64
  • per_device_eval_batch_size: 64
  • num_train_epochs: 8
  • warmup_ratio: 0.1
  • fp16: True

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 64
  • per_device_eval_batch_size: 64
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 8
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss Validation Loss allstats-semantic-search-mini-v2-eval_spearman_cosine allstat-semantic-search-mini-v2-test_spearman_cosine
0.1307 500 0.0963 0.0657 0.6836 -
0.2614 1000 0.0558 0.0428 0.7480 -
0.3921 1500 0.0403 0.0335 0.7665 -
0.5227 2000 0.0324 0.0285 0.7744 -
0.6534 2500 0.0284 0.0255 0.7987 -
0.7841 3000 0.0246 0.0225 0.7883 -
0.9148 3500 0.0217 0.0217 0.7964 -
1.0455 4000 0.0193 0.0187 0.8111 -
1.1762 4500 0.017 0.0174 0.8086 -
1.3068 5000 0.0163 0.0170 0.8157 -
1.4375 5500 0.0157 0.0161 0.8000 -
1.5682 6000 0.015 0.0156 0.8133 -
1.6989 6500 0.0146 0.0146 0.8194 -
1.8296 7000 0.014 0.0140 0.8103 -
1.9603 7500 0.013 0.0132 0.8205 -
2.0910 8000 0.0111 0.0126 0.8353 -
2.2216 8500 0.0102 0.0123 0.8407 -
2.3523 9000 0.0101 0.0118 0.8389 -
2.4830 9500 0.01 0.0115 0.8444 -
2.6137 10000 0.0097 0.0111 0.8456 -
2.7444 10500 0.0097 0.0105 0.8524 -
2.8751 11000 0.0091 0.0102 0.8526 -
3.0058 11500 0.0088 0.0100 0.8561 -
3.1364 12000 0.0069 0.0095 0.8619 -
3.2671 12500 0.0071 0.0094 0.8534 -
3.3978 13000 0.0068 0.0092 0.8648 -
3.5285 13500 0.0069 0.0093 0.8638 -
3.6592 14000 0.0071 0.0091 0.8548 -
3.7899 14500 0.0065 0.0085 0.8711 -
3.9205 15000 0.0064 0.0084 0.8622 -
4.0512 15500 0.0061 0.0080 0.8675 -
4.1819 16000 0.0051 0.0082 0.8673 -
4.3126 16500 0.0052 0.0080 0.8659 -
4.4433 17000 0.0053 0.0078 0.8669 -
4.5740 17500 0.0053 0.0077 0.8690 -
4.7047 18000 0.005 0.0076 0.8758 -
4.8353 18500 0.0048 0.0074 0.8700 -
4.9660 19000 0.0049 0.0072 0.8785 -
5.0967 19500 0.0041 0.0070 0.8795 -
5.2274 20000 0.0039 0.0071 0.8803 -
5.3581 20500 0.0039 0.0071 0.8843 -
5.4888 21000 0.0041 0.0070 0.8818 -
5.6194 21500 0.0039 0.0069 0.8812 -
5.7501 22000 0.0038 0.0068 0.8868 -
5.8808 22500 0.0038 0.0067 0.8831 -
6.0115 23000 0.0037 0.0066 0.8869 -
6.1422 23500 0.003 0.0065 0.8888 -
6.2729 24000 0.0031 0.0064 0.8879 -
6.4036 24500 0.0032 0.0064 0.8881 -
6.5342 25000 0.003 0.0062 0.8919 -
6.6649 25500 0.0031 0.0062 0.8919 -
6.7956 26000 0.0031 0.0061 0.8910 -
6.9263 26500 0.003 0.0061 0.8911 -
7.0570 27000 0.0028 0.0061 0.8925 -
7.1877 27500 0.0025 0.0061 0.8922 -
7.3183 28000 0.0026 0.0060 0.8944 -
7.4490 28500 0.0026 0.0061 0.8953 -
7.5797 29000 0.0026 0.0060 0.8948 -
7.7104 29500 0.0025 0.0060 0.8941 -
7.8411 30000 0.0025 0.0059 0.8950 -
7.9718 30500 0.0025 0.0059 0.8951 -
8.0 30608 - - - 0.8922

Framework Versions

  • Python: 3.10.12
  • Sentence Transformers: 3.3.1
  • Transformers: 4.47.1
  • PyTorch: 2.5.1+cu124
  • Accelerate: 1.2.1
  • Datasets: 3.2.0
  • Tokenizers: 0.21.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}
Downloads last month
10
Safetensors
Model size
118M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for yahyaabd/allstats-semantic-search-mini-model-v2

Dataset used to train yahyaabd/allstats-semantic-search-mini-model-v2

Evaluation results

  • Pearson Cosine on allstats semantic search mini v2 eval
    self-reported
    0.984
  • Spearman Cosine on allstats semantic search mini v2 eval
    self-reported
    0.895
  • Pearson Cosine on allstat semantic search mini v2 test
    self-reported
    0.983
  • Spearman Cosine on allstat semantic search mini v2 test
    self-reported
    0.892