|
--- |
|
base_model: google/pegasus-x-base |
|
tags: |
|
- generated_from_trainer |
|
datasets: |
|
- arxiv-summarization |
|
|
|
widget: |
|
- text: >- |
|
|
|
[Abstract] The dominant sequence transduction models are based on complex |
|
recurrent or convolutional neural networks in an encoder-decoder |
|
configuration. The best performing models also connect the encoder and |
|
decoder through an attention mechanism. We propose a new simple network |
|
architecture, the Transformer, based solely on attention mechanisms, |
|
dispensing with recurrence and convolutions entirely. Experiments on two |
|
machine translation tasks show these models to be superior in quality while |
|
being more parallelizable and requiring significantly less time to train. |
|
Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation |
|
task, improving over the existing best results, including ensembles by over |
|
2 BLEU. On the WMT 2014 English-to-French translation task, our model |
|
establishes a new single-model state-of-the-art BLEU score of 41.8 after |
|
training for 3.5 days on eight GPUs, a small fraction of the training costs |
|
of the best models from the literature. We show that the Transformer |
|
generalizes well to other tasks by applying it successfully to English |
|
constituency parsing both with large and limited training data. |
|
[Introduction] Recurrent neural networks, long short-term memory [13] and |
|
gated recurrent [7] neural networks in particular, have been firmly |
|
established as state of the art approaches in sequence modeling and |
|
transduction problems such as language modeling and machine translation [35, |
|
2, 5]. Numerous efforts have since continued to push the boundaries of |
|
recurrent language models and encoder-decoder architectures [38, 24, 15]. |
|
Recurrent models typically factor computation along the symbol positions of |
|
the input and output sequences. Aligning the positions to steps in |
|
computation time, they generate a sequence of hidden states ht, as a |
|
function of the previous hidden state ht−1 and the input for position t. |
|
This inherently sequential nature precludes parallelization within training |
|
examples, which becomes critical at longer sequence lengths, as memory |
|
constraints limit batching across examples. Recent work has achieved |
|
significant improvements in computational efficiency through factorization |
|
tricks [21] and conditional computation [32], while also improving model |
|
performance in case of the latter. The fundamental constraint of sequential |
|
computation, however, remains. Attention mechanisms have become an integral |
|
part of compelling sequence modeling and transduction models in various |
|
tasks, allowing modeling of dependencies without regard to their distance in |
|
the input or output sequences [2, 19]. In all but a few cases [27], however, |
|
such attention mechanisms are used in conjunction with a recurrent network. |
|
In this work we propose the Transformer, a model architecture eschewing |
|
recurrence and instead relying entirely on an attention mechanism to draw |
|
global dependencies between input and output. The Transformer allows for |
|
significantly more parallelization and can reach a new state of the art in |
|
translation quality after being trained for as little as twelve hours on |
|
eight P100 GPUs. |
|
example_title: Attention Is All You Need |
|
- text: >- |
|
[Abstract] In this work, we explore prompt tuning, a simple yet effective |
|
mechanism for learning soft prompts to condition frozen language models to |
|
perform specific downstream tasks. Unlike the discrete text prompts used by |
|
GPT-3, soft prompts are learned through backpropagation and can be tuned to |
|
incorporate signal from any number of labeled examples. Our end-to-end |
|
learned approach outperforms GPT-3's few-shot learning by a large margin. |
|
More remarkably, through ablations on model size using T5, we show that |
|
prompt tuning becomes more competitive with scale: as models exceed billions |
|
of parameters, our method closes the gap and matches the strong performance |
|
of model tuning (where all model weights are tuned). This finding is |
|
especially relevant in that large models are costly to share and serve, and |
|
the ability to reuse one frozen model for multiple downstream tasks can ease |
|
this burden. Our method can be seen as a simplification of the recently |
|
proposed prefix tuning of Li and Liang (2021), and we provide a comparison |
|
to this and other similar approaches. Finally, we show that conditioning a |
|
frozen model with soft prompts confers benefits in robustness to domain |
|
transfer, as compared to full model tuning. [Introduction] With the wide |
|
success of pre-trained large language models, a range of techniques has |
|
arisen to adapt these general-purpose models to downstream tasks. ELMo |
|
(Peters et al., 2018) proposed freezing the pre-trained model and learning a |
|
task-specific weighting of its per-layer representations. However, since GPT |
|
(Radford et al., 2018) and BERT (Devlin et al., 2019), the dominant |
|
adaptation technique has been model tuning (or fine-tuning), where all model |
|
parameters are tuned during adaptation, as proposed by Howard and Ruder |
|
(2018).More recently, Brown et al. (2020) showed that prompt design (or |
|
priming) is surprisingly effective at modulating a frozen GPT-3 model’s |
|
behavior through text prompts. Prompts are typically composed of a task |
|
description and/or several canonical examples. This return to freezing |
|
pre-trained models is appealing, especially as model size continues to |
|
increase. Rather than requiring a separate copy of the model for each |
|
downstream task, a single generalist model can simultaneously serve many |
|
different tasks. Unfortunately, prompt-based adaptation has several key |
|
drawbacks. Task description is error-prone and requires human involvement, |
|
and the effectiveness of a prompt is limited by how much conditioning text |
|
can fit into the model’s input. As a result, downstream task quality still |
|
lags far behind that of tuned models. For instance, GPT-3 175B fewshot |
|
performance on SuperGLUE is 17.5 points below fine-tuned T5-XXL (Raffel et |
|
al., 2020) (71.8 vs. 89.3) despite using 16 times more parameters. Several |
|
efforts to automate prompt design have been recently proposed. Shin et al. |
|
(2020) propose a search algorithm over the discrete space of words, guided |
|
by the downstream application training data. While this technique |
|
outperforms manual prompt design, there is still a gap relative to model |
|
tuning. Li and Liang (2021) propose prefix tuning and show strong results on |
|
generative tasks. This method freezes the model parameters and |
|
backpropagates the error during tuning to prefix activations prepended to |
|
each layer in the encoder stack, including the input layer. Hambardzumyan et |
|
al. (2021) simplify this recipe by restricting the trainable parameters to |
|
the input and output subnetworks of a masked language model, and show |
|
reasonable results on classifications tasks. In this paper, we propose |
|
prompt tuning as a further simplification for adapting language models. We |
|
freeze the entire pre-trained model and only allow an additional k tunable |
|
tokens per downstream task to be prepended to the input text. This soft |
|
prompt is trained end-to-end and can condense the signal from a full labeled |
|
dataset, allowing our method to outperform few-shot prompts and close the |
|
quality gap with model tuning (Figure 1). At the same time, since a single |
|
pre-trained model is recycled for all downstream tasks, we retain the |
|
efficient serving benefits of frozen models (Figure 2). While we developed |
|
our method concurrently with Li and Liang (2021) and Hambardzumyan et al. |
|
(2021), we are the first to show that prompt tuning alone (with no |
|
intermediate-layer prefixes or task-specific output layers) is sufficient to |
|
be competitive with model tuning. Through detailed experiments in sections |
|
2–3, we demonstrate that language model capacity is a key ingredient for |
|
these approaches to succeed. As Figure 1 shows, prompt tuning becomes more |
|
competitive with scale. We compare with similar approaches in Section 4. |
|
Explicitly separating task-specific parameters from the generalist |
|
parameters needed for general language-understanding has a range of |
|
additional benefits. We show in Section 5 that by capturing the task |
|
definition in the prompt while keeping the generalist parameters fixed, we |
|
are able to achieve better resilience to domain shifts. In Section 6, we |
|
show that prompt ensembling, learning multiple prompts for the same task, |
|
can boost quality and is more efficient than classic model ensembling. |
|
Finally, in Section 7, we investigate the interpretability of our learned |
|
soft prompts. In sum, our key contributions are: 1. Proposing prompt tuning |
|
and showing its competitiveness with model tuning in the regime of large |
|
language models. 2. Ablating many design choices, and showing quality and |
|
robustness improve with scale. 3. Showing prompt tuning outperforms model |
|
tuning on domain shift problems. 4. Proposing prompt ensembling and showing |
|
its effectiveness. |
|
example_title: PEFT (2104.08691) |
|
- text: >- |
|
[Abstract] For the first time in the world, we succeeded in synthesizing the |
|
room-temperature superconductor (Tc≥400 K, 127∘C) working at ambient |
|
pressure with a modified lead-apatite (LK-99) structure. The |
|
superconductivity of LK-99 is proved with the Critical temperature (Tc), |
|
Zero-resistivity, Critical current (Ic), Critical magnetic field (Hc), and |
|
the Meissner effect. The superconductivity of LK-99 originates from minute |
|
structural distortion by a slight volume shrinkage (0.48 %), not by external |
|
factors such as temperature and pressure. The shrinkage is caused by Cu2+ |
|
substitution of Pb2+(2) ions in the insulating network of Pb(2)-phosphate |
|
and it generates the stress. It concurrently transfers to Pb(1) of the |
|
cylindrical column resulting in distortion of the cylindrical column |
|
interface, which creates superconducting quantum wells (SQWs) in the |
|
interface. The heat capacity results indicated that the new model is |
|
suitable for explaining the superconductivity of LK-99. The unique structure |
|
of LK-99 that allows the minute distorted structure to be maintained in the |
|
interfaces is the most important factor that LK-99 maintains and exhibits |
|
superconductivity at room temperatures and ambient pressure. [Introduction] |
|
Since the discovery of the first superconductor(1), many efforts to search |
|
for new roomtemperature superconductors have been carried out worldwide(2, |
|
3) through their experimental clarity or/and theoretical perspectives(4-8). |
|
The recent success of developing room-temperature superconductors with |
|
hydrogen sulfide(9) and yttrium super-hydride(10) has great attention |
|
worldwide, which is expected by strong electron-phonon coupling theory with |
|
high-frequency hydrogen phonon modes(11, 12). However, it is difficult to |
|
apply them to actual application devices in daily life because of the |
|
tremendously high pressure, and more efforts are being made to overcome the |
|
high-pressure problem(13). For the first time in the world, we report the |
|
success in synthesizing a room-temperature and ambient-pressure |
|
superconductor with a chemical approach to solve the temperature and |
|
pressure problem. We named the first room temperature and ambient pressure |
|
superconductor LK-99. The superconductivity of LK-99 proved with the |
|
Critical temperature (Tc), Zero-resistivity, Critical current (Ic), Critical |
|
magnetic field (Hc), and Meissner effect(14, 15). Several data were |
|
collected and analyzed in detail to figure out the puzzle of |
|
superconductivity of LK-99: X-ray diffraction (XRD), X-ray photoelectron |
|
spectroscopy (XPS), Electron Paramagnetic Resonance Spectroscopy (EPR), Heat |
|
Capacity, and Superconducting quantum interference device (SQUID) data. |
|
Henceforth in this paper, we will report and discuss our new findings |
|
including superconducting quantum wells associated with the |
|
superconductivity of LK-99. |
|
example_title: LK-99 (Not NLP) |
|
- text: >- |
|
[Abstract] Abstract Evaluation practices in natural language generation |
|
(NLG) have many known flaws, but improved evaluation approaches are rarely |
|
widely adopted. This issue has become more urgent, since neural NLG models |
|
have improved to the point where they can often no longer be distinguished |
|
based on the surfacelevel features that older metrics rely on. This paper |
|
surveys the issues with human and automatic model evaluations and with |
|
commonly used datasets in NLG that have been pointed out over the past 20 |
|
years. We summarize, categorize, and discuss how researchers have been |
|
addressing these issues and what their findings mean for the current state |
|
of model evaluations. Building on those insights, we lay out a long-term |
|
vision for NLG evaluation and propose concrete steps for researchers to |
|
improve their evaluation processes. Finally, we analyze 66 NLG papers from |
|
recent NLP conferences in how well they already follow these suggestions and |
|
identify which areas require more drastic changes to the status quo. |
|
[Introduction] There are many issues with the evaluation of models that |
|
generate natural language. For example, datasets are often constructed in a |
|
way that prevents measuring tail effects of robustness, and they almost |
|
exclusively cover English. Most automated metrics measure only similarity |
|
between model output and references instead of fine-grained quality aspects |
|
(and even that poorly). Human evaluations have a high variance and, due to |
|
insufficient documentation, rarely produce replicable results. These issues |
|
have become more urgent as the nature of models that generate language has |
|
changed without significant changes to how they are being evaluated. While |
|
evaluation methods can capture surface-level improvements in text generated |
|
by state-of-the-art models (such as increased fluency) to some extent, they |
|
are ill-suited to detect issues with the content of model outputs, for |
|
example if they are not attributable to input information. These ineffective |
|
evaluations lead to overestimates of model capabilities. Deeper analyses |
|
uncover that popular models fail even at simple tasks by taking shortcuts, |
|
overfitting, hallucinating, and not being in accordance with their |
|
communicative goals. Identifying these shortcomings, many recent papers |
|
critique evaluation techniques or propose new ones. But almost none of the |
|
suggestions are followed or new techniques used. There is an incentive |
|
mismatch between conducting high-quality evaluations and publishing new |
|
models or modeling techniques. While general-purpose evaluation techniques |
|
could lower the barrier of entry for incorporating evaluation advances into |
|
model development, their development requires resources that are hard to |
|
come by, including model outputs on validation and test sets or large |
|
quantities of human assessments of such outputs. Moreover, some issues, like |
|
the refinement of datasets, require iterative processes where many |
|
researchers collaborate. All this leads to a circular dependency where |
|
evaluations of generation models can be improved only if generation models |
|
use better evaluations. We find that there is a systemic difference between |
|
selecting the best model and characterizing how good this model really is. |
|
Current evaluation techniques focus on the first, while the second is |
|
required to detect crucial issues. More emphasis needs to be put on |
|
measuring and reporting model limitations, rather than focusing on producing |
|
the highest performance numbers. To that end, this paper surveys analyses |
|
and critiques of evaluation approaches (sections 3 and 4) and of commonly |
|
used NLG datasets (section 5). Drawing on their insights, we describe how |
|
researchers developing modeling techniques can help to improve and |
|
subsequently benefit from better evaluations with methods available today |
|
(section 6). Expanding on existing work on model documentation and formal |
|
evaluation processes (Mitchell et al., 2019; Ribeiro et al., 2020), we |
|
propose releasing evaluation reports which focus on demonstrating NLG model |
|
shortcomings using evaluation suites. These reports should apply a |
|
complementary set of automatic metrics, include rigorous human evaluations, |
|
and be accompanied by data releases that allow for re-analysis with improved |
|
metrics. In an analysis of 66 recent EMNLP, INLG, and ACL papers along 29 |
|
dimensions related to our suggestions (section 7), we find that the first |
|
steps toward an improved evaluation are already frequently taken at an |
|
average rate of 27%. The analysis uncovers the dimensions that require more |
|
drastic changes in the NLG community. For example, 84% of papers already |
|
report results on multiple datasets and more than 28% point out issues in |
|
them, but we found only a single paper that contributed to the dataset |
|
documentation, leaving future researchers to re-identify those issues. We |
|
further highlight typical unsupported claims and a need for more consistent |
|
data release practices. Following the suggestions and results, we discuss |
|
how incorporating the suggestions can improve evaluation research, how the |
|
suggestions differ from similar ones made for NLU, and how better metrics |
|
can benefit model development itself (section 8). |
|
example_title: NLG-Eval (2202.06935) |
|
model-index: |
|
- name: Long-paper-summarization-pegasus-x-b |
|
results: |
|
- task: |
|
name: Summarization |
|
type: summarization |
|
dataset: |
|
name: ccdv/arxiv-summarization |
|
type: ccdv/arxiv-summarization |
|
config: section |
|
split: test |
|
args: section |
|
metrics: |
|
- name: ROUGE-1 |
|
type: rouge |
|
value: 35.6639 |
|
- name: ROUGE-2 |
|
type: rouge |
|
value: 9.81362 |
|
- name: ROUGE-L |
|
type: rouge |
|
value: 19.9013 |
|
- name: ROUGE-LSum |
|
type: rouge |
|
value: 28.1444 |
|
|
|
license: mit |
|
language: |
|
- en |
|
metrics: |
|
- rouge |
|
|
|
--- |
|
|
|
|
|
|
|
|
|
<!-- This model card has been generated automatically according to the information the Trainer had access to. You |
|
should probably proofread and complete it, then remove this comment. --> |
|
|
|
# Long-paper-summarization-pegasus-x-b |
|
|
|
This model is a fine-tuned version of [google/pegasus-x-base](https://huggingface.co/google/pegasus-x-base) on the arxiv-summarization dataset. |
|
It achieves the following results on the evaluation set: |
|
- Loss: 2.7262 |
|
|
|
|
|
## Model Description / Training and evaluation data |
|
|
|
|
|
**Base Model**: [Pegasus-x-base (State-of-the-art for Long Context Summarization)](https://huggingface.co/google/pegasus-x-base) |
|
|
|
**Finetuning Dataset**: |
|
- We used **train[25000:100000] of ArXiv Dataset (Cohan et al., 2018, NAACL-HLT 2018)** [[PDF]](https://arxiv.org/abs/1804.05685) |
|
- (Full length is 200,000+, We will upload full trained Model soon) |
|
|
|
**GPU**: (RTX A6000) x 1 |
|
|
|
**Train time**: About 24 hours for 3 epochs |
|
|
|
**Test time**: About 8 hours for test dataset. |
|
|
|
|
|
## Intended uses & limitations |
|
|
|
- **Research Paper Summarization** |
|
|
|
|
|
|
|
### Training hyperparameters |
|
|
|
The following hyperparameters were used during training: |
|
- learning_rate: 1e-05 |
|
- train_batch_size: 1 |
|
- eval_batch_size: 1 |
|
- seed: 42 |
|
- gradient_accumulation_steps: 64 |
|
- total_train_batch_size: 64 |
|
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 |
|
- lr_scheduler_type: linear |
|
- lr_scheduler_warmup_steps: 390 |
|
- **num_epochs: 3 (takes about 24 hours)** |
|
|
|
### Training results |
|
|
|
| Training Loss | Epoch | Step | Validation Loss | |
|
|:-------------:|:-----:|:----:|:---------------:| |
|
| 3.401 | 0.33 | 390 | 2.3985 | |
|
| 2.5444 | 0.67 | 780 | 2.2461 | |
|
| 2.4849 | 1.0 | 1170 | 2.2690 | |
|
| 2.5735 | 1.33 | 1560 | 2.3334 | |
|
| 2.7045 | 1.66 | 1950 | 2.4330 | |
|
| 2.8939 | 2.0 | 2340 | 2.5461 | |
|
| 3.0773 | 2.33 | 2730 | 2.6502 | |
|
| 3.2149 | 2.66 | 3120 | 2.7039 | |
|
| 3.2844 | 3.0 | 3510 | 2.7262 | |
|
|
|
|
|
|
|
### Framework versions |
|
|
|
- Transformers 4.32.1 |
|
- Pytorch 2.0.1 |
|
- Datasets 2.12.0 |
|
- Tokenizers 0.13.2 |