metadata

tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:5000
  - loss:MultipleNegativesRankingLoss
base_model: lufercho/my-finetuned-bert-mlm
widget:
  - source_sentence: |-
      A Comprehensive Approach to Universal Piecewise Nonlinear Regression
        Based on Trees
    sentences:
      - >2
          In sparse recovery we are given a matrix $A$ (the dictionary) and a vector of
        the form $A X$ where $X$ is sparse, and the goal is to recover $X$. This
        is a

        central notion in signal processing, statistics and machine learning.
        But in

        applications such as sparse coding, edge detection, compression and
        super

        resolution, the dictionary $A$ is unknown and has to be learned from
        random

        examples of the form $Y = AX$ where $X$ is drawn from an appropriate

        distribution --- this is the dictionary learning problem. In most
        settings, $A$

        is overcomplete: it has more columns than rows. This paper presents a

        polynomial-time algorithm for learning overcomplete dictionaries; the
        only

        previously known algorithm with provable guarantees is the recent work
        of

        Spielman, Wang and Wright who gave an algorithm for the full-rank case,
        which

        is rarely the case in applications. Our algorithm applies to incoherent

        dictionaries which have been a central object of study since they were

        introduced in seminal work of Donoho and Huo. In particular, a
        dictionary is

        $\mu$-incoherent if each pair of columns has inner product at most $\mu
        /

        \sqrt{n}$.
          The algorithm makes natural stochastic assumptions about the unknown sparse
        vector $X$, which can contain $k \leq c \min(\sqrt{n}/\mu \log n, m^{1/2

        -\eta})$ non-zero entries (for any $\eta > 0$). This is close to the
        best $k$

        allowable by the best sparse recovery algorithms even if one knows the

        dictionary $A$ exactly. Moreover, both the running time and sample
        complexity

        depend on $\log 1/\epsilon$, where $\epsilon$ is the target accuracy,
        and so

        our algorithms converge very quickly to the true dictionary. Our
        algorithm can

        also tolerate substantial amounts of noise provided it is incoherent
        with

        respect to the dictionary (e.g., Gaussian). In the noisy setting, our
        running

        time and sample complexity depend polynomially on $1/\epsilon$, and this
        is

        necessary.
      - >2
          In this paper, we investigate adaptive nonlinear regression and introduce
        tree based piecewise linear regression algorithms that are highly
        efficient and

        provide significantly improved performance with guaranteed upper bounds
        in an

        individual sequence manner. We use a tree notion in order to partition
        the

        space of regressors in a nested structure. The introduced algorithms
        adapt not

        only their regression functions but also the complete tree structure
        while

        achieving the performance of the "best" linear mixture of a doubly
        exponential

        number of partitions, with a computational complexity only polynomial in
        the

        number of nodes of the tree. While constructing these algorithms, we
        also avoid

        using any artificial "weighting" of models (with highly data dependent

        parameters) and, instead, directly minimize the final regression error,
        which

        is the ultimate performance goal. The introduced methods are generic
        such that

        they can readily incorporate different tree construction methods such as
        random

        trees in their framework and can use different regressor or partitioning

        functions as demonstrated in the paper.
      - >2
          In this paper we propose a multi-task linear classifier learning problem
        called D-SVM (Dictionary SVM). D-SVM uses a dictionary of parameter
        covariance

        shared by all tasks to do multi-task knowledge transfer among different
        tasks.

        We formally define the learning problem of D-SVM and show two
        interpretations

        of this problem, from both the probabilistic and kernel perspectives.
        From the

        probabilistic perspective, we show that our learning formulation is
        actually a

        MAP estimation on all optimization variables. We also show its
        equivalence to a

        multiple kernel learning problem in which one is trying to find a
        re-weighting

        kernel for features from a dictionary of basis (despite the fact that
        only

        linear classifiers are learned). Finally, we describe an alternative

        optimization scheme to minimize the objective function and present
        empirical

        studies to valid our algorithm.
  - source_sentence: |-
      A Game-theoretic Machine Learning Approach for Revenue Maximization in
        Sponsored Search
    sentences:
      - >2
          A learning algorithm based on primary school teaching and learning is
        presented. The methodology is to continuously evaluate a student and to
        give

        them training on the examples for which they repeatedly fail, until,
        they can

        correctly answer all types of questions. This incremental learning
        procedure

        produces better learning curves by demanding the student to optimally
        dedicate

        their learning time on the failed examples. When used in machine
        learning, the

        algorithm is found to train a machine on a data with maximum variance in
        the

        feature space so that the generalization ability of the network
        improves. The

        algorithm has interesting applications in data mining, model evaluations
        and

        rare objects discovery.
      - >2
          In this paper we extend temporal difference policy evaluation algorithms to
        performance criteria that include the variance of the cumulative reward.
        Such

        criteria are useful for risk management, and are important in domains
        such as

        finance and process control. We propose both TD(0) and LSTD(lambda)
        variants

        with linear function approximation, prove their convergence, and
        demonstrate

        their utility in a 4-dimensional continuous state space problem.
      - >2
          Sponsored search is an important monetization channel for search engines, in
        which an auction mechanism is used to select the ads shown to users and

        determine the prices charged from advertisers. There have been several
        pieces

        of work in the literature that investigate how to design an auction
        mechanism

        in order to optimize the revenue of the search engine. However, due to
        some

        unrealistic assumptions used, the practical values of these studies are
        not

        very clear. In this paper, we propose a novel \emph{game-theoretic
        machine

        learning} approach, which naturally combines machine learning and game
        theory,

        and learns the auction mechanism using a bilevel optimization framework.
        In

        particular, we first learn a Markov model from historical data to
        describe how

        advertisers change their bids in response to an auction mechanism, and
        then for

        any given auction mechanism, we use the learnt model to predict its

        corresponding future bid sequences. Next we learn the auction mechanism
        through

        empirical revenue maximization on the predicted bid sequences. We show
        that the

        empirical revenue will converge when the prediction period approaches
        infinity,

        and a Genetic Programming algorithm can effectively optimize this
        empirical

        revenue. Our experiments indicate that the proposed approach is able to
        produce

        a much more effective auction mechanism than several baselines.
  - source_sentence: Normalized Online Learning
    sentences:
      - >2
          The Frank-Wolfe method (a.k.a. conditional gradient algorithm) for smooth
        optimization has regained much interest in recent years in the context
        of large

        scale optimization and machine learning. A key advantage of the method
        is that

        it avoids projections - the computational bottleneck in many
        applications -

        replacing it by a linear optimization step. Despite this advantage, the
        known

        convergence rates of the FW method fall behind standard first order
        methods for

        most settings of interest. It is an active line of research to derive
        faster

        linear optimization-based algorithms for various settings of convex

        optimization.
          In this paper we consider the special case of optimization over strongly
        convex sets, for which we prove that the vanila FW method converges at a
        rate

        of $\frac{1}{t^2}$. This gives a quadratic improvement in convergence
        rate

        compared to the general case, in which convergence is of the order

        $\frac{1}{t}$, and known to be tight. We show that various balls induced
        by

        $\ell_p$ norms, Schatten norms and group norms are strongly convex on
        one hand

        and on the other hand, linear optimization over these sets is
        straightforward

        and admits a closed-form solution. We further show how several previous

        fast-rate results for the FW method follow easily from our analysis.
      - >2
          We introduce online learning algorithms which are independent of feature
        scales, proving regret bounds dependent on the ratio of scales existent
        in the

        data rather than the absolute scale. This has several useful effects:
        there is

        no need to pre-normalize data, the test-time and test-space complexity
        are

        reduced, and the algorithms are more robust.
      - >2
          In order to achieve high efficiency of classification in intrusion detection,
        a compressed model is proposed in this paper which combines horizontal

        compression with vertical compression. OneR is utilized as horizontal

        com-pression for attribute reduction, and affinity propagation is
        employed as

        vertical compression to select small representative exemplars from large

        training data. As to be able to computationally compress the larger
        volume of

        training data with scalability, MapReduce based parallelization approach
        is

        then implemented and evaluated for each step of the model compression
        process

        abovementioned, on which common but efficient classification methods can
        be

        directly used. Experimental application study on two publicly available

        datasets of intrusion detection, KDD99 and CMDC2012, demonstrates that
        the

        classification using the compressed model proposed can effectively speed
        up the

        detection procedure at up to 184 times, most importantly at the cost of
        a

        minimal accuracy difference with less than 1% on average.
  - source_sentence: Bounds on the Bethe Free Energy for Gaussian Networks
    sentences:
      - >2
          We extend the Bayesian Information Criterion (BIC), an asymptotic
        approximation for the marginal likelihood, to Bayesian networks with
        hidden

        variables. This approximation can be used to select models given large
        samples

        of data. The standard BIC as well as our extension punishes the
        complexity of a

        model according to the dimension of its parameters. We argue that the
        dimension

        of a Bayesian network with hidden variables is the rank of the Jacobian
        matrix

        of the transformation between the parameters of the network and the
        parameters

        of the observable variables. We compute the dimensions of several
        networks

        including the naive Bayes model with a hidden root node.
      - >2
          Complex networks refer to large-scale graphs with nontrivial connection
        patterns. The salient and interesting features that the complex network
        study

        offer in comparison to graph theory are the emphasis on the dynamical

        properties of the networks and the ability of inherently uncovering
        pattern

        formation of the vertices. In this paper, we present a hybrid data

        classification technique combining a low level and a high level
        classifier. The

        low level term can be equipped with any traditional classification
        techniques,

        which realize the classification task considering only physical features
        (e.g.,

        geometrical or statistical features) of the input data. On the other
        hand, the

        high level term has the ability of detecting data patterns with semantic

        meanings. In this way, the classification is realized by means of the

        extraction of the underlying network's features constructed from the
        input

        data. As a result, the high level classification process measures the

        compliance of the test instances with the pattern formation of the
        training

        data. Out of various high level perspectives that can be utilized to
        capture

        semantic meaning, we utilize the dynamical features that are generated
        from a

        tourist walker in a networked environment. Specifically, a weighted
        combination

        of transient and cycle lengths generated by the tourist walk is employed
        for

        that end. Interestingly, our study shows that the proposed technique is
        able to

        further improve the already optimized performance of traditional
        classification

        techniques.
      - >2
          We address the problem of computing approximate marginals in Gaussian
        probabilistic models by using mean field and fractional Bethe
        approximations.

        As an extension of Welling and Teh (2001), we define the Gaussian
        fractional

        Bethe free energy in terms of the moment parameters of the approximate

        marginals and derive an upper and lower bound for it. We give necessary

        conditions for the Gaussian fractional Bethe free energies to be bounded
        from

        below. It turns out that the bounding condition is the same as the
        pairwise

        normalizability condition derived by Malioutov et al. (2006) as a
        sufficient

        condition for the convergence of the message passing algorithm. By
        giving a

        counterexample, we disprove the conjecture in Welling and Teh (2001):
        even when

        the Bethe free energy is not bounded from below, it can possess a local
        minimum

        to which the minimization algorithms can converge.
  - source_sentence: Multi-Armed Bandits in Metric Spaces
    sentences:
      - >2
          The paper presents a FrameNet-based information extraction and knowledge
        representation framework, called FrameNet-CNL. The framework is used on
        natural

        language documents and represents the extracted knowledge in a
        tailor-made

        Frame-ontology from which unambiguous FrameNet-CNL paraphrase text can
        be

        generated automatically in multiple languages. This approach brings
        together

        the fields of information extraction and CNL, because a source text can
        be

        considered belonging to FrameNet-CNL, if information extraction parser
        produces

        the correct knowledge representation as a result. We describe a

        state-of-the-art information extraction parser used by a national news
        agency

        and speculate that FrameNet-CNL eventually could shape the natural
        language

        subset used for writing the newswire articles.
      - >2
          Applications such as face recognition that deal with high-dimensional data
        need a mapping technique that introduces representation of
        low-dimensional

        features with enhanced discriminatory power and a proper classifier,
        able to

        classify those complex features. Most of traditional Linear Discriminant

        Analysis suffer from the disadvantage that their optimality criteria are
        not

        directly related to the classification ability of the obtained feature

        representation. Moreover, their classification accuracy is affected by
        the

        "small sample size" problem which is often encountered in FR tasks. In
        this

        short paper, we combine nonlinear kernel based mapping of data called
        KDDA with

        Support Vector machine classifier to deal with both of the shortcomings
        in an

        efficient and cost effective manner. The proposed here method is
        compared, in

        terms of classification accuracy, to other commonly used FR methods on
        UMIST

        face database. Results indicate that the performance of the proposed
        method is

        overall superior to those of traditional FR approaches, such as the
        Eigenfaces,

        Fisherfaces, and D-LDA methods and traditional linear classifiers.
      - >2
          In a multi-armed bandit problem, an online algorithm chooses from a set of
        strategies in a sequence of trials so as to maximize the total payoff of
        the

        chosen strategies. While the performance of bandit algorithms with a
        small

        finite strategy set is quite well understood, bandit problems with large

        strategy sets are still a topic of very active investigation, motivated
        by

        practical applications such as online auctions and web advertisement.
        The goal

        of such research is to identify broad and natural classes of strategy
        sets and

        payoff functions which enable the design of efficient solutions. In this
        work

        we study a very general setting for the multi-armed bandit problem in
        which the

        strategies form a metric space, and the payoff function satisfies a
        Lipschitz

        condition with respect to the metric. We refer to this problem as the

        "Lipschitz MAB problem". We present a complete solution for the
        multi-armed

        problem in this setting. That is, for every metric space (L,X) we define
        an

        isometry invariant which bounds from below the performance of Lipschitz
        MAB

        algorithms for X, and we present an algorithm which comes arbitrarily
        close to

        meeting this bound. Furthermore, our technique gives even better results
        for

        benign payoff functions.
pipeline_tag: sentence-similarity
library_name: sentence-transformers

SentenceTransformer based on lufercho/my-finetuned-bert-mlm

This is a sentence-transformers model finetuned from lufercho/my-finetuned-bert-mlm. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: lufercho/my-finetuned-bert-mlm
Maximum Sequence Length: 512 tokens
Output Dimensionality: 768 dimensions
Similarity Function: Cosine Similarity

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("lufercho/AxvBert-Sentente-Transformer")
# Run inference
sentences = [
    'Multi-Armed Bandits in Metric Spaces',
    '  In a multi-armed bandit problem, an online algorithm chooses from a set of\nstrategies in a sequence of trials so as to maximize the total payoff of the\nchosen strategies. While the performance of bandit algorithms with a small\nfinite strategy set is quite well understood, bandit problems with large\nstrategy sets are still a topic of very active investigation, motivated by\npractical applications such as online auctions and web advertisement. The goal\nof such research is to identify broad and natural classes of strategy sets and\npayoff functions which enable the design of efficient solutions. In this work\nwe study a very general setting for the multi-armed bandit problem in which the\nstrategies form a metric space, and the payoff function satisfies a Lipschitz\ncondition with respect to the metric. We refer to this problem as the\n"Lipschitz MAB problem". We present a complete solution for the multi-armed\nproblem in this setting. That is, for every metric space (L,X) we define an\nisometry invariant which bounds from below the performance of Lipschitz MAB\nalgorithms for X, and we present an algorithm which comes arbitrarily close to\nmeeting this bound. Furthermore, our technique gives even better results for\nbenign payoff functions.\n',
    '  Applications such as face recognition that deal with high-dimensional data\nneed a mapping technique that introduces representation of low-dimensional\nfeatures with enhanced discriminatory power and a proper classifier, able to\nclassify those complex features. Most of traditional Linear Discriminant\nAnalysis suffer from the disadvantage that their optimality criteria are not\ndirectly related to the classification ability of the obtained feature\nrepresentation. Moreover, their classification accuracy is affected by the\n"small sample size" problem which is often encountered in FR tasks. In this\nshort paper, we combine nonlinear kernel based mapping of data called KDDA with\nSupport Vector machine classifier to deal with both of the shortcomings in an\nefficient and cost effective manner. The proposed here method is compared, in\nterms of classification accuracy, to other commonly used FR methods on UMIST\nface database. Results indicate that the performance of the proposed method is\noverall superior to those of traditional FR approaches, such as the Eigenfaces,\nFisherfaces, and D-LDA methods and traditional linear classifiers.\n',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Training Details

Training Dataset

Unnamed Dataset

Size: 5,000 training samples
Columns: sentence_0 and sentence_1
Approximate statistics based on the first 1000 samples:
sentence_0 sentence_1
type string string
details
min: 4 tokens
mean: 13.29 tokens
max: 56 tokens

min: 26 tokens
mean: 202.49 tokens
max: 506 tokens

	sentence_0	sentence_1
type	string	string
details	min: 4 tokens mean: 13.29 tokens max: 56 tokens	min: 26 tokens mean: 202.49 tokens max: 506 tokens

Samples:

sentence_0	sentence_1
`Validation of nonlinear PCA`	Linear principal component analysis (PCA) can be extended to a nonlinear PCA by using artificial neural networks. But the benefit of curved components requires a careful control of the model complexity. Moreover, standard techniques for model selection, including cross-validation and more generally the use of an independent test set, fail when applied to nonlinear PCA because of its inherent unsupervised characteristics. This paper presents a new approach for validating the complexity of nonlinear PCA models by using the error in missing data estimation as a criterion for model selection. It is motivated by the idea that only the model of optimal complexity is able to predict missing values with the highest accuracy. While standard test set validation usually favours over-fitted nonlinear PCA models, the proposed model validation approach correctly selects the optimal model complexity.
`Learning Attitudes and Attributes from Multi-Aspect Reviews`	The majority of online reviews consist of plain-text feedback together with a single numeric score. However, there are multiple dimensions to products and opinions, and understanding the `aspects' that contribute to users' ratings may help us to better understand their individual preferences. For example, a user's impression of an audiobook presumably depends on aspects such as the story and the narrator, and knowing their opinions on these aspects may help us to recommend better products. In this paper, we build models for rating systems in which such dimensions are explicit, in the sense that users leave separate ratings for each aspect of a product. By introducing new corpora consisting of five million reviews, rated with between three and six aspects, we evaluate our models on three prediction tasks: First, we use our model to uncover which parts of a review discuss which of the rated aspects. Second, we use our model to summarize reviews, which for us means finding the sentences...
`Bayesian Differential Privacy through Posterior Sampling`	Differential privacy formalises privacy-preserving mechanisms that provide access to a database. We pose the question of whether Bayesian inference itself can be used directly to provide private access to data, with no modification. The answer is affirmative: under certain conditions on the prior, sampling from the posterior distribution can be used to achieve a desired level of privacy and utility. To do so, we generalise differential privacy to arbitrary dataset metrics, outcome spaces and distribution families. This allows us to also deal with non-i.i.d or non-tabular datasets. We prove bounds on the sensitivity of the posterior to the data, which gives a measure of robustness. We also show how to use posterior sampling to provide differentially private responses to queries, within a decision-theoretic framework. Finally, we provide bounds on the utility and on the distinguishability of datasets. The latter are complemented by a novel use of Le Cam's method to obtain lower bounds....

Loss: MultipleNegativesRankingLoss with these parameters:

{
    "scale": 20.0,
    "similarity_fct": "cos_sim"
}

Training Hyperparameters

Non-Default Hyperparameters

per_device_train_batch_size: 16
per_device_eval_batch_size: 16
num_train_epochs: 2
multi_dataset_batch_sampler: round_robin

All Hyperparameters

Click to expand

overwrite_output_dir: False
do_predict: False
eval_strategy: no
prediction_loss_only: True
per_device_train_batch_size: 16
per_device_eval_batch_size: 16
per_gpu_train_batch_size: None
per_gpu_eval_batch_size: None
gradient_accumulation_steps: 1
eval_accumulation_steps: None
torch_empty_cache_steps: None
learning_rate: 5e-05
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
max_grad_norm: 1
num_train_epochs: 2
max_steps: -1
lr_scheduler_type: linear
lr_scheduler_kwargs: {}
warmup_ratio: 0.0
warmup_steps: 0
log_level: passive
log_level_replica: warning
log_on_each_node: True
logging_nan_inf_filter: True
save_safetensors: True
save_on_each_node: False
save_only_model: False
restore_callback_states_from_checkpoint: False
no_cuda: False
use_cpu: False
use_mps_device: False
seed: 42
data_seed: None
jit_mode_eval: False
use_ipex: False
bf16: False
fp16: False
fp16_opt_level: O1
half_precision_backend: auto
bf16_full_eval: False
fp16_full_eval: False
tf32: None
local_rank: 0
ddp_backend: None
tpu_num_cores: None
tpu_metrics_debug: False
debug: []
dataloader_drop_last: False
dataloader_num_workers: 0
dataloader_prefetch_factor: None
past_index: -1
disable_tqdm: False
remove_unused_columns: True
label_names: None
load_best_model_at_end: False
ignore_data_skip: False
fsdp: []
fsdp_min_num_params: 0
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
fsdp_transformer_layer_cls_to_wrap: None
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
deepspeed: None
label_smoothing_factor: 0.0
optim: adamw_torch
optim_args: None
adafactor: False
group_by_length: False
length_column_name: length
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
dataloader_pin_memory: True
dataloader_persistent_workers: False
skip_memory_metrics: True
use_legacy_prediction_loop: False
push_to_hub: False
resume_from_checkpoint: None
hub_model_id: None
hub_strategy: every_save
hub_private_repo: False
hub_always_push: False
gradient_checkpointing: False
gradient_checkpointing_kwargs: None
include_inputs_for_metrics: False
include_for_metrics: []
eval_do_concat_batches: True
fp16_backend: auto
push_to_hub_model_id: None
push_to_hub_organization: None
mp_parameters:
auto_find_batch_size: False
full_determinism: False
torchdynamo: None
ray_scope: last
ddp_timeout: 1800
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
dispatch_batches: None
split_batches: None
include_tokens_per_second: False
include_num_input_tokens_seen: False
neftune_noise_alpha: None
optim_target_modules: None
batch_eval_metrics: False
eval_on_start: False
use_liger_kernel: False
eval_use_gather_object: False
average_tokens_across_devices: False
prompts: None
batch_sampler: batch_sampler
multi_dataset_batch_sampler: round_robin

Training Logs

Epoch	Step	Training Loss
1.5974	500	0.3039

Framework Versions

Python: 3.10.12
Sentence Transformers: 3.3.1
Transformers: 4.46.2
PyTorch: 2.5.1+cu121
Accelerate: 1.1.1
Datasets: 3.1.0
Tokenizers: 0.20.3

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}