BERT Medium Amharic Text Embedding

This is a sentence-transformers model finetuned from yosefw/bert-medium-am-embed on the json dataset. It maps sentences & paragraphs to a 512-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: yosefw/bert-medium-am-embed
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 512 dimensions
  • Similarity Function: Cosine Similarity
  • Training Dataset:
    • json
  • Language: en
  • License: apache-2.0

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 512, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("rasyosef/bert-amharic-text-embedding-medium")
# Run inference
sentences = [
  "የተደጋገመው የመሬት መንቀጥቀጥና የእሳተ ገሞራ ምልክት በአፋር ክልል",
  "ከተደጋጋሚ መሬት መንቀጥቀጥ በኋላ አፋር ክልል እሳት ከመሬት ውስጥ ሲፈላ ታይቷል፡፡ ከመሬት ውስጥ እሳትና ጭስ የሚተፋው እንፋሎቱ ዛሬ ማለዳውን 11 ሰዓት ግድም ከከባድ ፍንዳታ በኋላየተስተዋለ መሆኑን የአከባቢው ነዋሪዎች እና ባለስልጣናት ለዶቼ ቬለ ተናግረዋል፡፡ አለት የሚያፈናጥር እሳት ነው የተባለው እንፋሎቱ በክልሉ ጋቢረሱ (ዞን 03) ዱለቻ ወረዳ ሰጋንቶ ቀበሌ መከሰቱን የገለጹት የአከባቢው የአይን እማኞች ከዋናው ፍንዳታ በተጨማሪ በዙሪያው ተጨማሪ ፍንዳታዎች መታየት ቀጥሏል ባይ ናቸው፡፡",
  "ለኢትዮጵያ ብሔራዊ ባንክ ዋጋን የማረጋጋት ቀዳሚ ዓላማ ጋር የተጣጣሙ የገንዘብ ፖሊሲ ምክረ ሀሳቦችን እንዲሰጥ የተቋቋመው የኢትዮጵያ ብሔራዊ ባንክ የገንዘብ ፖሊሲ ኮሚቴ እስካለፈው ህዳር ወር የነበረው እአአ የ2024 የዋጋ ግሽበት በተለይምምግብ ነክ ምርቶች ላይ ከአንድ ዓመት በፊት ከነበው ጋር ሲነጻጸር መረጋጋት ማሳየቱን ጠቁሟል፡፡ ዶይቼ ቬለ ያነጋገራቸው የአዲስ አበባ ነዋሪዎች ግን በዚህ የሚስማሙ አይመስልም፡፡ ከአምና አንጻር ያልጨመረ ነገር የለም ባይ ናቸው፡፡ የኢኮኖሚ  ባለሙያም በሰጡን አስተያየት ጭማሪው በሁሉም ረገድ የተስተዋለ በመሆኑ የመንግስት ወጪን በመቀነስ ግብርናው ላይ አተኩሮ መስራት ምናልባትም የዋጋ መረጋጋቱን ሊያመጣ ይችላል ይላሉ፡፡"
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 512]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Information Retrieval

Metric dim_512 dim_384 dim_256 dim_128 dim_64
cosine_accuracy@1 0.581 0.5765 0.5714 0.5573 0.5162
cosine_accuracy@3 0.7081 0.7042 0.7061 0.6946 0.6519
cosine_accuracy@5 0.7587 0.7559 0.7514 0.7369 0.701
cosine_accuracy@10 0.8175 0.8149 0.8059 0.7931 0.7632
cosine_precision@1 0.581 0.5765 0.5714 0.5573 0.5162
cosine_precision@3 0.236 0.2347 0.2354 0.2315 0.2173
cosine_precision@5 0.1517 0.1512 0.1503 0.1474 0.1402
cosine_precision@10 0.0817 0.0815 0.0806 0.0793 0.0763
cosine_recall@1 0.581 0.5765 0.5714 0.5573 0.5162
cosine_recall@3 0.7081 0.7042 0.7061 0.6946 0.6519
cosine_recall@5 0.7587 0.7559 0.7514 0.7369 0.701
cosine_recall@10 0.8175 0.8149 0.8059 0.7931 0.7632
cosine_ndcg@10 0.6958 0.6919 0.6863 0.6734 0.6374
cosine_mrr@10 0.6572 0.653 0.6482 0.6353 0.5975
cosine_map@100 0.6628 0.6587 0.6543 0.6414 0.6043

Training Details

Training Dataset

json

  • Dataset: json
  • Size: 28,046 training samples
  • Columns: anchor and positive
  • Approximate statistics based on the first 1000 samples:
    anchor positive
    type string string
    details
    • min: 4 tokens
    • mean: 15.02 tokens
    • max: 38 tokens
    • min: 47 tokens
    • mean: 213.84 tokens
    • max: 512 tokens
  • Samples:
    anchor positive
    የዱር እንስሳት ከሰዎች ጋር በሚኖራቸው ቁርኝት ለኮሮናቫይረስ ተጋላጭ እንዳይሆኑ የመከላከል ተግባራትን እያከናወኑ መሆኑን ባለስልጣኑ አስታወቀ፡፡ ባሕርዳር፡ ግንቦት 18/2012 ዓ.ም (አብመድ) የአማራ ክልል የአካባቢ፣ የደንና የዱር እንስሳት ጥበቃና ልማት ባለስልጣን በሚያስተዳድራቸው ብሔራዊ ፓርኮች እና የማኅበረሰብ ጥብቅ ሥፍራዎች ከኮሮናቫይረስ ተጋላጭነት ለመከላከል እየሠራ መሆኑን አስታውቋል፡፡የባለስልጣኑ የኮሙዩኒኬሽን ዳይሬክተር ጋሻው እሸቱ 10 በሚሆኑ ብሔራዊ ፓርኮችና የማኅበረሰብ ጥብቅ ሥፍራዎች የኮሮና ቫይረስን መከላከል በሚቻልባቸው ቅድመ ተግባራት እና ርምጃዎች ላይ መምከራቸውን ተናግረዋል፡፡ የዱር እንስሳት በመንጋ የሚኖሩ፣ እርስ በርሳቸው ተመጋጋቢ፣ ከሰዎች እና ከቤት እንስሳቶች ጋር ሊቀላቀሉ የሚችሉ በመሆናቸው በኮሮናቫይረስ ከተጋለጡ ‘‘የኮሮናቫይረስ ተጋላጭነት በብርቅየ የዱር እንስሳት ብዝኃ ሕይወት ላይ ስጋት መሆን የለበትም’’ ያሉት አቶ ጋሻው በፓርኮቹ ውስጥ ለሚሠሩ የጥበቃ፣ ስካውት እና ለጽሕፈት ቤት ሠራተኞች በዘርፉ ላይ ያተኮረ የኮሮናቫይረስ መከላከያ ትምህርቶችን እና የቁሳቁስ ድጋፎችን ማድረጋቸውን አስታውቀዋል፡፡
    የትግራይ ክልል የአየር መሥመር ለአገልግሎት ክፍት ሆነ፡፡
    የትግራይ ክልል የአየር መሥመር ለአገልግሎት ክፍት ሆነ፡፡
    ባሕር ዳር፡ ታኅሣሥ 05/2013 ዓ.ም (አብመድ) በሰሜን ኢትዮጵያ ትግራይ ክልል የህግ ማስከበር ሂደትን ተከትሎ ተዘግቶ የነበረው የአየር ክልል ከዛሬ ታህሣሥ 5/2013 ዓ.ም ከቀኑ 8 ሰዓት ጀምሮ በሰሜን የኢትዮጵያ የአየር ክልል ውስጥ የሚያቋርጡ የአለም አቀፍ እና የሃገር ውስጥ የበረራ መስመሮች ለአገልግሎት ክፍት ሆነዋል፡፡ አገልግሎት መሥጠት የሚችሉ ኤርፖርቶች በረራ ማስተናገድ የሚችሉ መሆኑንም የኢትዮጵያ ሲቪል አቪዬሽን ባለስልጣን ገልጿል::
    የአውሮፓ ኢንቨስትመንት ባንክ ለመንግሥት 76 ሚሊዮን ዶላር ሊያበድር ነው በዳዊት እንደሻውየአውሮፓ ኢንቨስትመንት ባንክ ጽሕፈት ቤቱን በአዲስ አበባ ከከፈተ ከሁለት ዓመት በኋላ ትልቅ ነው የተባለለትን የ76 ሚሊዮን ዶላር ብድር ስምምነት ለመፈራረም፣ ኃላፊዎቹን ወደ ኢትዮጵያ ይልካል፡፡ከወር በፊት በኢትዮጵያ መንግሥትና በባንኩ መካከል የተደረገው ይኼ የብድር ስምምነት፣ የኢትዮጵያ ልማት ባንክ በሊዝ ፋይናንሲንግ ለአነስተኛና ለመካከለኛ ኢንተርፕራይዞች ለሚያደርገው እገዛ ይውላል፡፡የአውሮፓ ኢንቨስትመንት ባንክ ምክትል ፕሬዚዳንት ፒም ቫን በሌኮም፣ እንዲሁም ሌሎች ኃላፊዎች ይመጣሉ ተብሎ ይጠበቃል፡፡በዚህም መሠረት የባንኩ ኃላፊዎች ከገንዘብና ኢኮኖሚ ትብብር ሚኒስቴር ጋር አድርገውት ከነበረው ስምምነት የሚቀጥልና ተመሳሳይ የሆነ ስምምነት፣ ከኢትዮጵያ ልማት ባንክ ጋር እንደሚያደርጉ ይጠበቃል፡፡እ.ኤ.አ. እስከ 2022 ድረስ የሚቀጥለው አነስተኛና መካከለኛ ኢንተርፕራይዞችን የማገዝ ፕሮጀክት 276 ሚሊዮን ዶላር ወጪ የሚያስወጣ ሲሆን፣ ባለፈው ዓመት የዓለም ባንክ ወደ 200 ሚሊዮን ዶላር ብድር ሰጥቷል፡፡በአውሮፓ ኢንቨስትመንት ባንክ የሚሰጠው ብድር፣ የኢትዮጵያ ልማት ባንክን የሊዝ ፋይናንሲንግ ሥራ እንደሚያግዝ ጉዳዩ የሚመለከታቸው የልማት ባንክ ኃላፊዎች ለሪፖርተር ተናግረዋል፡፡ ‹‹በተጨማሪም የውጭ ምንዛሪ እጥረቱን ለማቃለል ያግዛል፤›› ሲሉ ኃላፊው ገልጸዋል፡፡በልማት ባንክ በኩል የሚደረገው እገዛ በሁለት መስኮቶች የሚወጣ ሲሆን፣ አንደኛው በቀጥታ በባንክ እንደ ሊዝ ፋይናንሲንግ ሲሰጥ ሌላው ደግሞ እንደ መሥሪያ ካፒታል ልማት ባንክ ለመረጣቸው 12 ባንኮችና ዘጠኝ ማይክሮ ፋይናንሶች ይሰጣል፡፡የአውሮፓ ኢንቨስትመንት ባንክ በኢትዮጵያ መንቀሳቀስ ከጀመረ ከ1980ዎቹ ጀምሮ ወደ ግማሽ ቢሊዮን ዶላር የሚጠጋ ለኃይል፣ ለኮሙዩኒኬሽንና ለግሉ ዘርፍ ኢ...
  • Loss: MatryoshkaLoss with these parameters:
    {
        "loss": "MultipleNegativesRankingLoss",
        "matryoshka_dims": [
            512,
            384,
            256,
            128,
            64
        ],
        "matryoshka_weights": [
            1,
            1,
            1,
            1,
            1
        ],
        "n_dims_per_step": -1
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: epoch
  • per_device_train_batch_size: 64
  • per_device_eval_batch_size: 64
  • gradient_accumulation_steps: 2
  • num_train_epochs: 4
  • lr_scheduler_type: cosine
  • warmup_ratio: 0.1
  • fp16: True
  • load_best_model_at_end: True
  • optim: adamw_torch_fused
  • batch_sampler: no_duplicates

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: epoch
  • prediction_loss_only: True
  • per_device_train_batch_size: 64
  • per_device_eval_batch_size: 64
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 2
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 4
  • max_steps: -1
  • lr_scheduler_type: cosine
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch_fused
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: no_duplicates
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss dim_512_cosine_ndcg@10 dim_384_cosine_ndcg@10 dim_256_cosine_ndcg@10 dim_128_cosine_ndcg@10 dim_64_cosine_ndcg@10
0.0456 10 14.3172 - - - - -
0.0911 20 11.9004 - - - - -
0.1367 30 8.2867 - - - - -
0.1822 40 4.869 - - - - -
0.2278 50 3.7541 - - - - -
0.2733 60 3.1055 - - - - -
0.3189 70 2.6283 - - - - -
0.3645 80 2.2792 - - - - -
0.4100 90 2.0364 - - - - -
0.4556 100 1.9502 - - - - -
0.5011 110 1.6862 - - - - -
0.5467 120 1.6991 - - - - -
0.5923 130 1.5849 - - - - -
0.6378 140 1.3585 - - - - -
0.6834 150 1.464 - - - - -
0.7289 160 1.6712 - - - - -
0.7745 170 1.4967 - - - - -
0.8200 180 1.4184 - - - - -
0.8656 190 1.2148 - - - - -
0.9112 200 1.3443 - - - - -
0.9567 210 1.1794 - - - - -
1.0 220 1.1257 0.6572 0.6578 0.6471 0.6308 0.5837
1.0456 230 1.2824 - - - - -
1.0911 240 1.2316 - - - - -
1.1367 250 1.1745 - - - - -
1.1822 260 0.9189 - - - - -
1.2278 270 0.977 - - - - -
1.2733 280 0.9832 - - - - -
1.3189 290 0.9445 - - - - -
1.3645 300 0.8845 - - - - -
1.4100 310 0.754 - - - - -
1.4556 320 0.7767 - - - - -
1.5011 330 0.6453 - - - - -
1.5467 340 0.6502 - - - - -
1.5923 350 0.6711 - - - - -
1.6378 360 0.6081 - - - - -
1.6834 370 0.5782 - - - - -
1.7289 380 0.793 - - - - -
1.7745 390 0.6978 - - - - -
1.8200 400 0.7294 - - - - -
1.8656 410 0.6582 - - - - -
1.9112 420 0.5806 - - - - -
1.9567 430 0.5558 - - - - -
2.0 440 0.5417 0.6831 0.6801 0.6744 0.6640 0.6246
2.0456 450 0.6179 - - - - -
2.0911 460 0.5952 - - - - -
2.1367 470 0.604 - - - - -
2.1822 480 0.4688 - - - - -
2.2278 490 0.4907 - - - - -
2.2733 500 0.5165 - - - - -
2.3189 510 0.4703 - - - - -
2.3645 520 0.4971 - - - - -
2.4100 530 0.4522 - - - - -
2.4556 540 0.4145 - - - - -
2.5011 550 0.344 - - - - -
2.5467 560 0.392 - - - - -
2.5923 570 0.3371 - - - - -
2.6378 580 0.3402 - - - - -
2.6834 590 0.3535 - - - - -
2.7289 600 0.4581 - - - - -
2.7745 610 0.3701 - - - - -
2.8200 620 0.4221 - - - - -
2.8656 630 0.3886 - - - - -
2.9112 640 0.3828 - - - - -
2.9567 650 0.3737 - - - - -
3.0 660 0.3318 0.6921 0.6887 0.6852 0.6699 0.6339
3.0456 670 0.4025 - - - - -
3.0911 680 0.4092 - - - - -
3.1367 690 0.3605 - - - - -
3.1822 700 0.3218 - - - - -
3.2278 710 0.3362 - - - - -
3.2733 720 0.3451 - - - - -
3.3189 730 0.3476 - - - - -
3.3645 740 0.3594 - - - - -
3.4100 750 0.3324 - - - - -
3.4556 760 0.3144 - - - - -
3.5011 770 0.2667 - - - - -
3.5467 780 0.3241 - - - - -
3.5923 790 0.253 - - - - -
3.6378 800 0.2916 - - - - -
3.6834 810 0.2632 - - - - -
3.7289 820 0.348 - - - - -
3.7745 830 0.2788 - - - - -
3.8200 840 0.3224 - - - - -
3.8656 850 0.3144 - - - - -
3.9112 860 0.2926 - - - - -
3.9567 870 0.3002 - - - - -
3.9841 876 - 0.6958 0.6919 0.6863 0.6734 0.6374
  • The bold row denotes the saved checkpoint.

Framework Versions

  • Python: 3.10.12
  • Sentence Transformers: 3.3.1
  • Transformers: 4.47.1
  • PyTorch: 2.5.1+cu121
  • Accelerate: 1.2.1
  • Datasets: 3.2.0
  • Tokenizers: 0.21.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Downloads last month
16
Safetensors
Model size
40.4M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for rasyosef/bert-amharic-text-embedding-medium

Finetuned
(1)
this model

Evaluation results