yahyaabd's picture
Add new SentenceTransformer model
9be9cdc verified
metadata
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:244856
  - loss:CosineSimilarityLoss
base_model: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
widget:
  - source_sentence: Bulan apa inflasi sebesar 0,63 persen terjadi pada tahun 2013?
    sentences:
      - Pada bulan Mei 2013 terjadi inflasi sebesar 0,2 persen
      - >-
        Nilai Tukar Petani (NTP) April 2024 sebesar 116,79 atau turun 2,18
        persen.
      - >-
        Posisi Kredit Perbankan<sup>1</sup>dalam Rupiah dan Valuta Asing Menurut
        Sektor Ekonomi (miliar rupiah), 2016-2018
  - source_sentence: Berapa persen penurunan Nilai Tukar Petani NTP Februari 2017
    sentences:
      - Produksi Tanaman Pangan Angka Ramalan II Tahun 2015
      - >-
        Nilai Tukar Petani (NTP) Februari 2017 Sebesar 100,33 Atau Turun 0,58
        Persen
      - Buletin Statistik Perdagangan Luar Negeri Ekspor Menurut HS, Juni 2024
  - source_sentence: analisis industri pariwisata indonesia tahun 2013
    sentences:
      - Ringkasan Neraca Arus Dana, Triwulan IV, 2012), (Miliar Rupiah)
      - Pengeluaran Untuk Konsumsi Penduduk Indonesia September 2014
      - >-
        Buletin Statistik Perdagangan Luar Negeri Ekspor Menurut Kelompok
        Komoditi dan Negara, Desember 2020
  - source_sentence: Sosial ekonomi Indonesia bulan November 2020
    sentences:
      - Pos Kesehatan Desa
      - >-
        Jumlah Wisman Pada Januari 2011 Naik 11,14 Persen dan Penumpang Angkutan
        Udara Domestik Pada Januari 2011 Turun 6,88 Persen
      - Laporan Bulanan Data Sosial Ekonomi September 2017
  - source_sentence: Tahun berapa Rupiah terdepresiasi 0,23 persen terhadap Dolar Amerika?
    sentences:
      - 'Nilai Impor Menurut Negara Asal Utama (Nilai CIF: juta US$), 2000-2023'
      - Ringkasan Neraca Arus Dana Triwulan Pertama, 2002, (Miliar Rupiah)
      - >-
        Depresiasi Rupiah terhadap Dolar Amerika pada tahun 2016 sebesar 0,5
        persen.
datasets:
  - yahyaabd/allstats-semantic-search-synthetic-dataset-v2
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
  - pearson_cosine
  - spearman_cosine
model-index:
  - name: >-
      SentenceTransformer based on
      sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
    results:
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          name: allstats semantic search mini v2 eval
          type: allstats-semantic-search-mini-v2-eval
        metrics:
          - type: pearson_cosine
            value: 0.9838643974678674
            name: Pearson Cosine
          - type: spearman_cosine
            value: 0.8951406685580494
            name: Spearman Cosine
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          name: allstat semantic search mini v2 test
          type: allstat-semantic-search-mini-v2-test
        metrics:
          - type: pearson_cosine
            value: 0.98307083670705
            name: Pearson Cosine
          - type: spearman_cosine
            value: 0.8922084062478435
            name: Spearman Cosine

SentenceTransformer based on sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2

This is a sentence-transformers model finetuned from sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 on the allstats-semantic-search-synthetic-dataset-v2 dataset. It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("yahyaabd/allstats-semantic-search-mini-model-v2")
# Run inference
sentences = [
    'Tahun berapa Rupiah terdepresiasi 0,23 persen terhadap Dolar Amerika?',
    'Depresiasi Rupiah terhadap Dolar Amerika pada tahun 2016 sebesar 0,5 persen.',
    'Ringkasan Neraca Arus Dana Triwulan Pertama, 2002, (Miliar Rupiah)',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Semantic Similarity

Metric allstats-semantic-search-mini-v2-eval allstat-semantic-search-mini-v2-test
pearson_cosine 0.9839 0.9831
spearman_cosine 0.8951 0.8922

Training Details

Training Dataset

allstats-semantic-search-synthetic-dataset-v2

  • Dataset: allstats-semantic-search-synthetic-dataset-v2 at c76f31a
  • Size: 244,856 training samples
  • Columns: query, doc, and label
  • Approximate statistics based on the first 1000 samples:
    query doc label
    type string string float
    details
    • min: 3 tokens
    • mean: 12.75 tokens
    • max: 45 tokens
    • min: 4 tokens
    • mean: 14.81 tokens
    • max: 56 tokens
    • min: 0.0
    • mean: 0.54
    • max: 1.0
  • Samples:
    query doc label
    Dtaa harg konsymen edesaan (non-makann) 201 Statistik Harga Konsumen Perdesaan Kelompok Nonmakanan (Data 2013) 0.95
    Bagaimna konidsi keuamgan rymah atngga Indonsia 2020-2022? Statistik Perusahaan Perikanan 2007 0.1
    Tingkat hunian kamar hotel tahun 2023 Tingkat Penghunian Kamar Hotel 2023 0.99
  • Loss: CosineSimilarityLoss with these parameters:
    {
        "loss_fct": "torch.nn.modules.loss.MSELoss"
    }
    

Evaluation Dataset

allstats-semantic-search-synthetic-dataset-v2

  • Dataset: allstats-semantic-search-synthetic-dataset-v2 at c76f31a
  • Size: 52,469 evaluation samples
  • Columns: query, doc, and label
  • Approximate statistics based on the first 1000 samples:
    query doc label
    type string string float
    details
    • min: 3 tokens
    • mean: 13.04 tokens
    • max: 43 tokens
    • min: 4 tokens
    • mean: 15.01 tokens
    • max: 54 tokens
    • min: 0.0
    • mean: 0.52
    • max: 1.0
  • Samples:
    query doc label
    Bulan apa NTP mengalami kenaikan 0,25 persen? Jumlah Wisatawan Mancanegara Bulan Agustus 2009 Turun 4,49 Persen Dibandingkan Bulan Sebelumnya. 0.0
    Sebutksn keempa komositi tang disebutkn besert persentae mrajin persagangannya. Marjin Perdagangan Minyak Goreng 3,86 Persen, Terigu 5,92 Persen, Garam 23,74 Persen, Dan Susu Bubuk 13,02 Persen 1.0
    Data kemiskinan per kabupaten/kota tahun 2007 Data dan Informasi Kemiskinan 2007 Buku 2: Kabupaten/Kota 0.87
  • Loss: CosineSimilarityLoss with these parameters:
    {
        "loss_fct": "torch.nn.modules.loss.MSELoss"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 64
  • per_device_eval_batch_size: 64
  • num_train_epochs: 8
  • warmup_ratio: 0.1
  • fp16: True

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 64
  • per_device_eval_batch_size: 64
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 8
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss Validation Loss allstats-semantic-search-mini-v2-eval_spearman_cosine allstat-semantic-search-mini-v2-test_spearman_cosine
0.1307 500 0.0963 0.0657 0.6836 -
0.2614 1000 0.0558 0.0428 0.7480 -
0.3921 1500 0.0403 0.0335 0.7665 -
0.5227 2000 0.0324 0.0285 0.7744 -
0.6534 2500 0.0284 0.0255 0.7987 -
0.7841 3000 0.0246 0.0225 0.7883 -
0.9148 3500 0.0217 0.0217 0.7964 -
1.0455 4000 0.0193 0.0187 0.8111 -
1.1762 4500 0.017 0.0174 0.8086 -
1.3068 5000 0.0163 0.0170 0.8157 -
1.4375 5500 0.0157 0.0161 0.8000 -
1.5682 6000 0.015 0.0156 0.8133 -
1.6989 6500 0.0146 0.0146 0.8194 -
1.8296 7000 0.014 0.0140 0.8103 -
1.9603 7500 0.013 0.0132 0.8205 -
2.0910 8000 0.0111 0.0126 0.8353 -
2.2216 8500 0.0102 0.0123 0.8407 -
2.3523 9000 0.0101 0.0118 0.8389 -
2.4830 9500 0.01 0.0115 0.8444 -
2.6137 10000 0.0097 0.0111 0.8456 -
2.7444 10500 0.0097 0.0105 0.8524 -
2.8751 11000 0.0091 0.0102 0.8526 -
3.0058 11500 0.0088 0.0100 0.8561 -
3.1364 12000 0.0069 0.0095 0.8619 -
3.2671 12500 0.0071 0.0094 0.8534 -
3.3978 13000 0.0068 0.0092 0.8648 -
3.5285 13500 0.0069 0.0093 0.8638 -
3.6592 14000 0.0071 0.0091 0.8548 -
3.7899 14500 0.0065 0.0085 0.8711 -
3.9205 15000 0.0064 0.0084 0.8622 -
4.0512 15500 0.0061 0.0080 0.8675 -
4.1819 16000 0.0051 0.0082 0.8673 -
4.3126 16500 0.0052 0.0080 0.8659 -
4.4433 17000 0.0053 0.0078 0.8669 -
4.5740 17500 0.0053 0.0077 0.8690 -
4.7047 18000 0.005 0.0076 0.8758 -
4.8353 18500 0.0048 0.0074 0.8700 -
4.9660 19000 0.0049 0.0072 0.8785 -
5.0967 19500 0.0041 0.0070 0.8795 -
5.2274 20000 0.0039 0.0071 0.8803 -
5.3581 20500 0.0039 0.0071 0.8843 -
5.4888 21000 0.0041 0.0070 0.8818 -
5.6194 21500 0.0039 0.0069 0.8812 -
5.7501 22000 0.0038 0.0068 0.8868 -
5.8808 22500 0.0038 0.0067 0.8831 -
6.0115 23000 0.0037 0.0066 0.8869 -
6.1422 23500 0.003 0.0065 0.8888 -
6.2729 24000 0.0031 0.0064 0.8879 -
6.4036 24500 0.0032 0.0064 0.8881 -
6.5342 25000 0.003 0.0062 0.8919 -
6.6649 25500 0.0031 0.0062 0.8919 -
6.7956 26000 0.0031 0.0061 0.8910 -
6.9263 26500 0.003 0.0061 0.8911 -
7.0570 27000 0.0028 0.0061 0.8925 -
7.1877 27500 0.0025 0.0061 0.8922 -
7.3183 28000 0.0026 0.0060 0.8944 -
7.4490 28500 0.0026 0.0061 0.8953 -
7.5797 29000 0.0026 0.0060 0.8948 -
7.7104 29500 0.0025 0.0060 0.8941 -
7.8411 30000 0.0025 0.0059 0.8950 -
7.9718 30500 0.0025 0.0059 0.8951 -
8.0 30608 - - - 0.8922

Framework Versions

  • Python: 3.10.12
  • Sentence Transformers: 3.3.1
  • Transformers: 4.47.1
  • PyTorch: 2.5.1+cu124
  • Accelerate: 1.2.1
  • Datasets: 3.2.0
  • Tokenizers: 0.21.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}