dfsandovalp01's picture
Add new SentenceTransformer model.
1aa2970 verified
metadata
base_model: somosnlp-hackathon-2022/paraphrase-spanish-distilroberta
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:44147
  - loss:SoftmaxLoss
widget:
  - source_sentence: >-
      Componentes y Equipos para Distribución y Sistemas de Acondicionamiento
      Instalaciones de tubos y entubamientos
    sentences:
      - Frijoles verdes congelados Fríjoles congelados
      - >-
        Brida reductora para tubos de plástico cpvc Bridas reductoras para
        tubos 
      - >-
        Naranja hamlin orgánica en lata o en frasco Naranjas orgánicas en lata o
        en frasco
  - source_sentence: Componentes y Suministros de Manufactura Ferretería
    sentences:
      - Terfenadina Antihistamínicos (bloqueadores H1)
      - Tomates verde Tomates
      - Ciruela sloe seca Ciruelas secas
  - source_sentence: >-
      Servicios Públicos y Servicios Relacionados con el Sector Público
      Servicios públicos
    sentences:
      - Chalote pikant orgánico Chalotes orgánicos
      - Rosal cortado seco ciciolina Rosas cortadas secas rosados
      - Rosal vivo peach sherbet Rosales vivos anaranjados
  - source_sentence: >-
      Maquinaria y Accesorios para Manufactura y Procesamiento Industrial
      Maquinaria y accesorios para cortar metales
    sentences:
      - Pimentón peperoncini seco Pimientos Secos
      - Ciruela diamante rojo congelada orgánica Ciruelas orgánicas congeladas
      - >-
        Máquinas para dar formas al metal en la superficie Máquinas perforadoras
        de metales
  - source_sentence: Alimentos, Bebidas y Tabaco  Vegetales orgánicos secos
    sentences:
      - Coliflo rdok elgon orgánica seca Coliflores  orgánicas secas
      - Arame  orgánica seca Vegetales marinos orgánicos secos
      - Cereza dark guines Cerezas

SentenceTransformer based on somosnlp-hackathon-2022/paraphrase-spanish-distilroberta

This is a sentence-transformers model finetuned from somosnlp-hackathon-2022/paraphrase-spanish-distilroberta. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: RobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("dfsandovalp01/paraphrase-spanish-distilroberta-MDD-pucCO-V2")
# Run inference
sentences = [
    'Alimentos, Bebidas y Tabaco  Vegetales orgánicos secos',
    'Coliflo rdok elgon orgánica seca Coliflores  orgánicas secas',
    'Arame  orgánica seca Vegetales marinos orgánicos secos',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Training Details

Training Dataset

Unnamed Dataset

  • Size: 44,147 training samples
  • Columns: sentence_0, sentence_1, and label
  • Approximate statistics based on the first 1000 samples:
    sentence_0 sentence_1 label
    type string string int
    details
    • min: 5 tokens
    • mean: 15.49 tokens
    • max: 36 tokens
    • min: 3 tokens
    • mean: 13.39 tokens
    • max: 36 tokens
    • 0: ~48.80%
    • 1: ~8.30%
    • 2: ~42.90%
  • Samples:
    sentence_0 sentence_1 label
    Maquinaria y Accesorios para Generación y Distribución de Energía Generación de energía Amortiguador de veleta Equipo de cribado o estructuras de tubo de escape 0
    Alimentos, Bebidas y Tabaco Fruta orgánica en lata o en frasco Mangos mayaguez orgánico en lata o en frasco Mangos orgánicos en lata o en frasco 0
    Alimentos, Bebidas y Tabaco Fruta orgánica congelada Bolsa para transportar quimioterapia Equipo y suministros de quimioterapia 1
  • Loss: SoftmaxLoss

Training Hyperparameters

Non-Default Hyperparameters

  • per_device_train_batch_size: 16
  • per_device_eval_batch_size: 16
  • num_train_epochs: 1
  • multi_dataset_batch_sampler: round_robin

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: no
  • prediction_loss_only: True
  • per_device_train_batch_size: 16
  • per_device_eval_batch_size: 16
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1
  • num_train_epochs: 1
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.0
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: False
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • eval_use_gather_object: False
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: round_robin

Training Logs

Epoch Step Training Loss
0.1812 500 0.6649
0.3623 1000 0.4498
0.5435 1500 0.3788
0.7246 2000 0.3636
0.9058 2500 0.353
0.1812 500 0.3429
0.3623 1000 0.3254
0.5435 1500 0.3359
0.7246 2000 0.3209
0.9058 2500 0.3311

Framework Versions

  • Python: 3.10.12
  • Sentence Transformers: 3.1.0
  • Transformers: 4.44.2
  • PyTorch: 2.4.0+cu121
  • Accelerate: 0.34.2
  • Datasets: 3.0.0
  • Tokenizers: 0.19.1

Citation

BibTeX

Sentence Transformers and SoftmaxLoss

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}