---
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- generated_from_trainer
- dataset_size:5000
- loss:MultipleNegativesRankingLoss
base_model: lufercho/my-finetuned-bert-mlm
widget:
- source_sentence: "A Comprehensive Approach to Universal Piecewise Nonlinear Regression\n\
    \  Based on Trees"
  sentences:
  - "  In sparse recovery we are given a matrix $A$ (the dictionary) and a vector\
    \ of\nthe form $A X$ where $X$ is sparse, and the goal is to recover $X$. This\
    \ is a\ncentral notion in signal processing, statistics and machine learning.\
    \ But in\napplications such as sparse coding, edge detection, compression and\
    \ super\nresolution, the dictionary $A$ is unknown and has to be learned from\
    \ random\nexamples of the form $Y = AX$ where $X$ is drawn from an appropriate\n\
    distribution --- this is the dictionary learning problem. In most settings, $A$\n\
    is overcomplete: it has more columns than rows. This paper presents a\npolynomial-time\
    \ algorithm for learning overcomplete dictionaries; the only\npreviously known\
    \ algorithm with provable guarantees is the recent work of\nSpielman, Wang and\
    \ Wright who gave an algorithm for the full-rank case, which\nis rarely the case\
    \ in applications. Our algorithm applies to incoherent\ndictionaries which have\
    \ been a central object of study since they were\nintroduced in seminal work of\
    \ Donoho and Huo. In particular, a dictionary is\n$\\mu$-incoherent if each pair\
    \ of columns has inner product at most $\\mu /\n\\sqrt{n}$.\n  The algorithm makes\
    \ natural stochastic assumptions about the unknown sparse\nvector $X$, which can\
    \ contain $k \\leq c \\min(\\sqrt{n}/\\mu \\log n, m^{1/2\n-\\eta})$ non-zero\
    \ entries (for any $\\eta > 0$). This is close to the best $k$\nallowable by the\
    \ best sparse recovery algorithms even if one knows the\ndictionary $A$ exactly.\
    \ Moreover, both the running time and sample complexity\ndepend on $\\log 1/\\\
    epsilon$, where $\\epsilon$ is the target accuracy, and so\nour algorithms converge\
    \ very quickly to the true dictionary. Our algorithm can\nalso tolerate substantial\
    \ amounts of noise provided it is incoherent with\nrespect to the dictionary (e.g.,\
    \ Gaussian). In the noisy setting, our running\ntime and sample complexity depend\
    \ polynomially on $1/\\epsilon$, and this is\nnecessary.\n"
  - '  In this paper, we investigate adaptive nonlinear regression and introduce

    tree based piecewise linear regression algorithms that are highly efficient and

    provide significantly improved performance with guaranteed upper bounds in an

    individual sequence manner. We use a tree notion in order to partition the

    space of regressors in a nested structure. The introduced algorithms adapt not

    only their regression functions but also the complete tree structure while

    achieving the performance of the "best" linear mixture of a doubly exponential

    number of partitions, with a computational complexity only polynomial in the

    number of nodes of the tree. While constructing these algorithms, we also avoid

    using any artificial "weighting" of models (with highly data dependent

    parameters) and, instead, directly minimize the final regression error, which

    is the ultimate performance goal. The introduced methods are generic such that

    they can readily incorporate different tree construction methods such as random

    trees in their framework and can use different regressor or partitioning

    functions as demonstrated in the paper.

    '
  - '  In this paper we propose a multi-task linear classifier learning problem

    called D-SVM (Dictionary SVM). D-SVM uses a dictionary of parameter covariance

    shared by all tasks to do multi-task knowledge transfer among different tasks.

    We formally define the learning problem of D-SVM and show two interpretations

    of this problem, from both the probabilistic and kernel perspectives. From the

    probabilistic perspective, we show that our learning formulation is actually a

    MAP estimation on all optimization variables. We also show its equivalence to
    a

    multiple kernel learning problem in which one is trying to find a re-weighting

    kernel for features from a dictionary of basis (despite the fact that only

    linear classifiers are learned). Finally, we describe an alternative

    optimization scheme to minimize the objective function and present empirical

    studies to valid our algorithm.

    '
- source_sentence: "A Game-theoretic Machine Learning Approach for Revenue Maximization\
    \ in\n  Sponsored Search"
  sentences:
  - '  A learning algorithm based on primary school teaching and learning is

    presented. The methodology is to continuously evaluate a student and to give

    them training on the examples for which they repeatedly fail, until, they can

    correctly answer all types of questions. This incremental learning procedure

    produces better learning curves by demanding the student to optimally dedicate

    their learning time on the failed examples. When used in machine learning, the

    algorithm is found to train a machine on a data with maximum variance in the

    feature space so that the generalization ability of the network improves. The

    algorithm has interesting applications in data mining, model evaluations and

    rare objects discovery.

    '
  - '  In this paper we extend temporal difference policy evaluation algorithms to

    performance criteria that include the variance of the cumulative reward. Such

    criteria are useful for risk management, and are important in domains such as

    finance and process control. We propose both TD(0) and LSTD(lambda) variants

    with linear function approximation, prove their convergence, and demonstrate

    their utility in a 4-dimensional continuous state space problem.

    '
  - '  Sponsored search is an important monetization channel for search engines, in

    which an auction mechanism is used to select the ads shown to users and

    determine the prices charged from advertisers. There have been several pieces

    of work in the literature that investigate how to design an auction mechanism

    in order to optimize the revenue of the search engine. However, due to some

    unrealistic assumptions used, the practical values of these studies are not

    very clear. In this paper, we propose a novel \emph{game-theoretic machine

    learning} approach, which naturally combines machine learning and game theory,

    and learns the auction mechanism using a bilevel optimization framework. In

    particular, we first learn a Markov model from historical data to describe how

    advertisers change their bids in response to an auction mechanism, and then for

    any given auction mechanism, we use the learnt model to predict its

    corresponding future bid sequences. Next we learn the auction mechanism through

    empirical revenue maximization on the predicted bid sequences. We show that the

    empirical revenue will converge when the prediction period approaches infinity,

    and a Genetic Programming algorithm can effectively optimize this empirical

    revenue. Our experiments indicate that the proposed approach is able to produce

    a much more effective auction mechanism than several baselines.

    '
- source_sentence: Normalized Online Learning
  sentences:
  - "  The Frank-Wolfe method (a.k.a. conditional gradient algorithm) for smooth\n\
    optimization has regained much interest in recent years in the context of large\n\
    scale optimization and machine learning. A key advantage of the method is that\n\
    it avoids projections - the computational bottleneck in many applications -\n\
    replacing it by a linear optimization step. Despite this advantage, the known\n\
    convergence rates of the FW method fall behind standard first order methods for\n\
    most settings of interest. It is an active line of research to derive faster\n\
    linear optimization-based algorithms for various settings of convex\noptimization.\n\
    \  In this paper we consider the special case of optimization over strongly\n\
    convex sets, for which we prove that the vanila FW method converges at a rate\n\
    of $\\frac{1}{t^2}$. This gives a quadratic improvement in convergence rate\n\
    compared to the general case, in which convergence is of the order\n$\\frac{1}{t}$,\
    \ and known to be tight. We show that various balls induced by\n$\\ell_p$ norms,\
    \ Schatten norms and group norms are strongly convex on one hand\nand on the other\
    \ hand, linear optimization over these sets is straightforward\nand admits a closed-form\
    \ solution. We further show how several previous\nfast-rate results for the FW\
    \ method follow easily from our analysis.\n"
  - '  We introduce online learning algorithms which are independent of feature

    scales, proving regret bounds dependent on the ratio of scales existent in the

    data rather than the absolute scale. This has several useful effects: there is

    no need to pre-normalize data, the test-time and test-space complexity are

    reduced, and the algorithms are more robust.

    '
  - '  In order to achieve high efficiency of classification in intrusion detection,

    a compressed model is proposed in this paper which combines horizontal

    compression with vertical compression. OneR is utilized as horizontal

    com-pression for attribute reduction, and affinity propagation is employed as

    vertical compression to select small representative exemplars from large

    training data. As to be able to computationally compress the larger volume of

    training data with scalability, MapReduce based parallelization approach is

    then implemented and evaluated for each step of the model compression process

    abovementioned, on which common but efficient classification methods can be

    directly used. Experimental application study on two publicly available

    datasets of intrusion detection, KDD99 and CMDC2012, demonstrates that the

    classification using the compressed model proposed can effectively speed up the

    detection procedure at up to 184 times, most importantly at the cost of a

    minimal accuracy difference with less than 1% on average.

    '
- source_sentence: Bounds on the Bethe Free Energy for Gaussian Networks
  sentences:
  - '  We extend the Bayesian Information Criterion (BIC), an asymptotic

    approximation for the marginal likelihood, to Bayesian networks with hidden

    variables. This approximation can be used to select models given large samples

    of data. The standard BIC as well as our extension punishes the complexity of
    a

    model according to the dimension of its parameters. We argue that the dimension

    of a Bayesian network with hidden variables is the rank of the Jacobian matrix

    of the transformation between the parameters of the network and the parameters

    of the observable variables. We compute the dimensions of several networks

    including the naive Bayes model with a hidden root node.

    '
  - '  Complex networks refer to large-scale graphs with nontrivial connection

    patterns. The salient and interesting features that the complex network study

    offer in comparison to graph theory are the emphasis on the dynamical

    properties of the networks and the ability of inherently uncovering pattern

    formation of the vertices. In this paper, we present a hybrid data

    classification technique combining a low level and a high level classifier. The

    low level term can be equipped with any traditional classification techniques,

    which realize the classification task considering only physical features (e.g.,

    geometrical or statistical features) of the input data. On the other hand, the

    high level term has the ability of detecting data patterns with semantic

    meanings. In this way, the classification is realized by means of the

    extraction of the underlying network''s features constructed from the input

    data. As a result, the high level classification process measures the

    compliance of the test instances with the pattern formation of the training

    data. Out of various high level perspectives that can be utilized to capture

    semantic meaning, we utilize the dynamical features that are generated from a

    tourist walker in a networked environment. Specifically, a weighted combination

    of transient and cycle lengths generated by the tourist walk is employed for

    that end. Interestingly, our study shows that the proposed technique is able to

    further improve the already optimized performance of traditional classification

    techniques.

    '
  - '  We address the problem of computing approximate marginals in Gaussian

    probabilistic models by using mean field and fractional Bethe approximations.

    As an extension of Welling and Teh (2001), we define the Gaussian fractional

    Bethe free energy in terms of the moment parameters of the approximate

    marginals and derive an upper and lower bound for it. We give necessary

    conditions for the Gaussian fractional Bethe free energies to be bounded from

    below. It turns out that the bounding condition is the same as the pairwise

    normalizability condition derived by Malioutov et al. (2006) as a sufficient

    condition for the convergence of the message passing algorithm. By giving a

    counterexample, we disprove the conjecture in Welling and Teh (2001): even when

    the Bethe free energy is not bounded from below, it can possess a local minimum

    to which the minimization algorithms can converge.

    '
- source_sentence: Multi-Armed Bandits in Metric Spaces
  sentences:
  - '  The paper presents a FrameNet-based information extraction and knowledge

    representation framework, called FrameNet-CNL. The framework is used on natural

    language documents and represents the extracted knowledge in a tailor-made

    Frame-ontology from which unambiguous FrameNet-CNL paraphrase text can be

    generated automatically in multiple languages. This approach brings together

    the fields of information extraction and CNL, because a source text can be

    considered belonging to FrameNet-CNL, if information extraction parser produces

    the correct knowledge representation as a result. We describe a

    state-of-the-art information extraction parser used by a national news agency

    and speculate that FrameNet-CNL eventually could shape the natural language

    subset used for writing the newswire articles.

    '
  - '  Applications such as face recognition that deal with high-dimensional data

    need a mapping technique that introduces representation of low-dimensional

    features with enhanced discriminatory power and a proper classifier, able to

    classify those complex features. Most of traditional Linear Discriminant

    Analysis suffer from the disadvantage that their optimality criteria are not

    directly related to the classification ability of the obtained feature

    representation. Moreover, their classification accuracy is affected by the

    "small sample size" problem which is often encountered in FR tasks. In this

    short paper, we combine nonlinear kernel based mapping of data called KDDA with

    Support Vector machine classifier to deal with both of the shortcomings in an

    efficient and cost effective manner. The proposed here method is compared, in

    terms of classification accuracy, to other commonly used FR methods on UMIST

    face database. Results indicate that the performance of the proposed method is

    overall superior to those of traditional FR approaches, such as the Eigenfaces,

    Fisherfaces, and D-LDA methods and traditional linear classifiers.

    '
  - '  In a multi-armed bandit problem, an online algorithm chooses from a set of

    strategies in a sequence of trials so as to maximize the total payoff of the

    chosen strategies. While the performance of bandit algorithms with a small

    finite strategy set is quite well understood, bandit problems with large

    strategy sets are still a topic of very active investigation, motivated by

    practical applications such as online auctions and web advertisement. The goal

    of such research is to identify broad and natural classes of strategy sets and

    payoff functions which enable the design of efficient solutions. In this work

    we study a very general setting for the multi-armed bandit problem in which the

    strategies form a metric space, and the payoff function satisfies a Lipschitz

    condition with respect to the metric. We refer to this problem as the

    "Lipschitz MAB problem". We present a complete solution for the multi-armed

    problem in this setting. That is, for every metric space (L,X) we define an

    isometry invariant which bounds from below the performance of Lipschitz MAB

    algorithms for X, and we present an algorithm which comes arbitrarily close to

    meeting this bound. Furthermore, our technique gives even better results for

    benign payoff functions.

    '
pipeline_tag: sentence-similarity
library_name: sentence-transformers
---

# SentenceTransformer based on lufercho/my-finetuned-bert-mlm

This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [lufercho/my-finetuned-bert-mlm](https://huggingface.co/lufercho/my-finetuned-bert-mlm). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

## Model Details

### Model Description
- **Model Type:** Sentence Transformer
- **Base model:** [lufercho/my-finetuned-bert-mlm](https://huggingface.co/lufercho/my-finetuned-bert-mlm) <!-- at revision 8cf44893fd607477d06b067f1788b495abac1b2c -->
- **Maximum Sequence Length:** 512 tokens
- **Output Dimensionality:** 768 dimensions
- **Similarity Function:** Cosine Similarity
<!-- - **Training Dataset:** Unknown -->
<!-- - **Language:** Unknown -->
<!-- - **License:** Unknown -->

### Model Sources

- **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
- **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
- **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)

### Full Model Architecture

```
SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)
```

## Usage

### Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

```bash
pip install -U sentence-transformers
```

Then you can load this model and run inference.
```python
from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("lufercho/AxvBert-Sentente-Transformer")
# Run inference
sentences = [
    'Multi-Armed Bandits in Metric Spaces',
    '  In a multi-armed bandit problem, an online algorithm chooses from a set of\nstrategies in a sequence of trials so as to maximize the total payoff of the\nchosen strategies. While the performance of bandit algorithms with a small\nfinite strategy set is quite well understood, bandit problems with large\nstrategy sets are still a topic of very active investigation, motivated by\npractical applications such as online auctions and web advertisement. The goal\nof such research is to identify broad and natural classes of strategy sets and\npayoff functions which enable the design of efficient solutions. In this work\nwe study a very general setting for the multi-armed bandit problem in which the\nstrategies form a metric space, and the payoff function satisfies a Lipschitz\ncondition with respect to the metric. We refer to this problem as the\n"Lipschitz MAB problem". We present a complete solution for the multi-armed\nproblem in this setting. That is, for every metric space (L,X) we define an\nisometry invariant which bounds from below the performance of Lipschitz MAB\nalgorithms for X, and we present an algorithm which comes arbitrarily close to\nmeeting this bound. Furthermore, our technique gives even better results for\nbenign payoff functions.\n',
    '  Applications such as face recognition that deal with high-dimensional data\nneed a mapping technique that introduces representation of low-dimensional\nfeatures with enhanced discriminatory power and a proper classifier, able to\nclassify those complex features. Most of traditional Linear Discriminant\nAnalysis suffer from the disadvantage that their optimality criteria are not\ndirectly related to the classification ability of the obtained feature\nrepresentation. Moreover, their classification accuracy is affected by the\n"small sample size" problem which is often encountered in FR tasks. In this\nshort paper, we combine nonlinear kernel based mapping of data called KDDA with\nSupport Vector machine classifier to deal with both of the shortcomings in an\nefficient and cost effective manner. The proposed here method is compared, in\nterms of classification accuracy, to other commonly used FR methods on UMIST\nface database. Results indicate that the performance of the proposed method is\noverall superior to those of traditional FR approaches, such as the Eigenfaces,\nFisherfaces, and D-LDA methods and traditional linear classifiers.\n',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
```

<!--
### Direct Usage (Transformers)

<details><summary>Click to see the direct usage in Transformers</summary>

</details>
-->

<!--
### Downstream Usage (Sentence Transformers)

You can finetune this model on your own dataset.

<details><summary>Click to expand</summary>

</details>
-->

<!--
### Out-of-Scope Use

*List how the model may foreseeably be misused and address what users ought not to do with the model.*
-->

<!--
## Bias, Risks and Limitations

*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
-->

<!--
### Recommendations

*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
-->

## Training Details

### Training Dataset

#### Unnamed Dataset


* Size: 5,000 training samples
* Columns: <code>sentence_0</code> and <code>sentence_1</code>
* Approximate statistics based on the first 1000 samples:
  |         | sentence_0                                                                        | sentence_1                                                                           |
  |:--------|:----------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------|
  | type    | string                                                                            | string                                                                               |
  | details | <ul><li>min: 4 tokens</li><li>mean: 13.29 tokens</li><li>max: 56 tokens</li></ul> | <ul><li>min: 26 tokens</li><li>mean: 202.49 tokens</li><li>max: 506 tokens</li></ul> |
* Samples:
  | sentence_0                                                               | sentence_1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
  |:-------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
  | <code>Validation of nonlinear PCA</code>                                 | <code>  Linear principal component analysis (PCA) can be extended to a nonlinear PCA<br>by using artificial neural networks. But the benefit of curved components<br>requires a careful control of the model complexity. Moreover, standard<br>techniques for model selection, including cross-validation and more generally<br>the use of an independent test set, fail when applied to nonlinear PCA because<br>of its inherent unsupervised characteristics. This paper presents a new<br>approach for validating the complexity of nonlinear PCA models by using the<br>error in missing data estimation as a criterion for model selection. It is<br>motivated by the idea that only the model of optimal complexity is able to<br>predict missing values with the highest accuracy. While standard test set<br>validation usually favours over-fitted nonlinear PCA models, the proposed model<br>validation approach correctly selects the optimal model complexity.<br></code>                                                                                                       |
  | <code>Learning Attitudes and Attributes from Multi-Aspect Reviews</code> | <code>  The majority of online reviews consist of plain-text feedback together with a<br>single numeric score. However, there are multiple dimensions to products and<br>opinions, and understanding the `aspects' that contribute to users' ratings may<br>help us to better understand their individual preferences. For example, a<br>user's impression of an audiobook presumably depends on aspects such as the<br>story and the narrator, and knowing their opinions on these aspects may help us<br>to recommend better products. In this paper, we build models for rating systems<br>in which such dimensions are explicit, in the sense that users leave separate<br>ratings for each aspect of a product. By introducing new corpora consisting of<br>five million reviews, rated with between three and six aspects, we evaluate our<br>models on three prediction tasks: First, we use our model to uncover which<br>parts of a review discuss which of the rated aspects. Second, we use our model<br>to summarize reviews, which for us means finding the sentences...</code> |
  | <code>Bayesian Differential Privacy through Posterior Sampling</code>    | <code>  Differential privacy formalises privacy-preserving mechanisms that provide<br>access to a database. We pose the question of whether Bayesian inference itself<br>can be used directly to provide private access to data, with no modification.<br>The answer is affirmative: under certain conditions on the prior, sampling from<br>the posterior distribution can be used to achieve a desired level of privacy<br>and utility. To do so, we generalise differential privacy to arbitrary dataset<br>metrics, outcome spaces and distribution families. This allows us to also deal<br>with non-i.i.d or non-tabular datasets. We prove bounds on the sensitivity of<br>the posterior to the data, which gives a measure of robustness. We also show<br>how to use posterior sampling to provide differentially private responses to<br>queries, within a decision-theoretic framework. Finally, we provide bounds on<br>the utility and on the distinguishability of datasets. The latter are<br>complemented by a novel use of Le Cam's method to obtain lower bounds....</code> |
* Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
  ```json
  {
      "scale": 20.0,
      "similarity_fct": "cos_sim"
  }
  ```

### Training Hyperparameters
#### Non-Default Hyperparameters

- `per_device_train_batch_size`: 16
- `per_device_eval_batch_size`: 16
- `num_train_epochs`: 2
- `multi_dataset_batch_sampler`: round_robin

#### All Hyperparameters
<details><summary>Click to expand</summary>

- `overwrite_output_dir`: False
- `do_predict`: False
- `eval_strategy`: no
- `prediction_loss_only`: True
- `per_device_train_batch_size`: 16
- `per_device_eval_batch_size`: 16
- `per_gpu_train_batch_size`: None
- `per_gpu_eval_batch_size`: None
- `gradient_accumulation_steps`: 1
- `eval_accumulation_steps`: None
- `torch_empty_cache_steps`: None
- `learning_rate`: 5e-05
- `weight_decay`: 0.0
- `adam_beta1`: 0.9
- `adam_beta2`: 0.999
- `adam_epsilon`: 1e-08
- `max_grad_norm`: 1
- `num_train_epochs`: 2
- `max_steps`: -1
- `lr_scheduler_type`: linear
- `lr_scheduler_kwargs`: {}
- `warmup_ratio`: 0.0
- `warmup_steps`: 0
- `log_level`: passive
- `log_level_replica`: warning
- `log_on_each_node`: True
- `logging_nan_inf_filter`: True
- `save_safetensors`: True
- `save_on_each_node`: False
- `save_only_model`: False
- `restore_callback_states_from_checkpoint`: False
- `no_cuda`: False
- `use_cpu`: False
- `use_mps_device`: False
- `seed`: 42
- `data_seed`: None
- `jit_mode_eval`: False
- `use_ipex`: False
- `bf16`: False
- `fp16`: False
- `fp16_opt_level`: O1
- `half_precision_backend`: auto
- `bf16_full_eval`: False
- `fp16_full_eval`: False
- `tf32`: None
- `local_rank`: 0
- `ddp_backend`: None
- `tpu_num_cores`: None
- `tpu_metrics_debug`: False
- `debug`: []
- `dataloader_drop_last`: False
- `dataloader_num_workers`: 0
- `dataloader_prefetch_factor`: None
- `past_index`: -1
- `disable_tqdm`: False
- `remove_unused_columns`: True
- `label_names`: None
- `load_best_model_at_end`: False
- `ignore_data_skip`: False
- `fsdp`: []
- `fsdp_min_num_params`: 0
- `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
- `fsdp_transformer_layer_cls_to_wrap`: None
- `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
- `deepspeed`: None
- `label_smoothing_factor`: 0.0
- `optim`: adamw_torch
- `optim_args`: None
- `adafactor`: False
- `group_by_length`: False
- `length_column_name`: length
- `ddp_find_unused_parameters`: None
- `ddp_bucket_cap_mb`: None
- `ddp_broadcast_buffers`: False
- `dataloader_pin_memory`: True
- `dataloader_persistent_workers`: False
- `skip_memory_metrics`: True
- `use_legacy_prediction_loop`: False
- `push_to_hub`: False
- `resume_from_checkpoint`: None
- `hub_model_id`: None
- `hub_strategy`: every_save
- `hub_private_repo`: False
- `hub_always_push`: False
- `gradient_checkpointing`: False
- `gradient_checkpointing_kwargs`: None
- `include_inputs_for_metrics`: False
- `include_for_metrics`: []
- `eval_do_concat_batches`: True
- `fp16_backend`: auto
- `push_to_hub_model_id`: None
- `push_to_hub_organization`: None
- `mp_parameters`: 
- `auto_find_batch_size`: False
- `full_determinism`: False
- `torchdynamo`: None
- `ray_scope`: last
- `ddp_timeout`: 1800
- `torch_compile`: False
- `torch_compile_backend`: None
- `torch_compile_mode`: None
- `dispatch_batches`: None
- `split_batches`: None
- `include_tokens_per_second`: False
- `include_num_input_tokens_seen`: False
- `neftune_noise_alpha`: None
- `optim_target_modules`: None
- `batch_eval_metrics`: False
- `eval_on_start`: False
- `use_liger_kernel`: False
- `eval_use_gather_object`: False
- `average_tokens_across_devices`: False
- `prompts`: None
- `batch_sampler`: batch_sampler
- `multi_dataset_batch_sampler`: round_robin

</details>

### Training Logs
| Epoch  | Step | Training Loss |
|:------:|:----:|:-------------:|
| 1.5974 | 500  | 0.3039        |


### Framework Versions
- Python: 3.10.12
- Sentence Transformers: 3.3.1
- Transformers: 4.46.2
- PyTorch: 2.5.1+cu121
- Accelerate: 1.1.1
- Datasets: 3.1.0
- Tokenizers: 0.20.3

## Citation

### BibTeX

#### Sentence Transformers
```bibtex
@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}
```

#### MultipleNegativesRankingLoss
```bibtex
@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
```

<!--
## Glossary

*Clearly define terms in order to be accessible across audiences.*
-->

<!--
## Model Card Authors

*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
-->

<!--
## Model Card Contact

*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
-->