kdvisdjf rkeogkw
commited on
Commit
·
477fc5c
1
Parent(s):
412cefe
Update README.md
Browse files
README.md
CHANGED
@@ -34,8 +34,7 @@ This is the <a href="https://huggingface.co/facebook/bart-base">bart-base</a> (<
|
|
34 |
* Krapivin (<a href = "http://eprints.biblio.unitn.it/1671/1/disi09055%2Dkrapivin%2Dautayeu%2Dmarchese.pdf">Krapivin et al., 2009</a>)
|
35 |
* Inspec (<a href = "https://aclanthology.org/W03-1028.pdf">Hulth, 2003</a>)
|
36 |
|
37 |
-
Inspired by <a href = "https://aclanthology.org/2020.findings-emnlp.428.pdf">(Cachola et al., 2020)</a>, we applied control codes to fine-tune BART in a multi-task manner. First, we create a training set containing comma-separated lists of keyphrases and titles as text generation targets. For this purpose, we form text-title and text-keyphrases pairs based on the original text corpus. Second, we append each source text in the training set with control codes <|TITLE|> and <|KEYPHRASES|> respectively. After that, the training set is shuffled in random order. Finally, the preprocessed training set is utilized to fine-tune the pre-trained BART model
|
38 |
-
|
39 |
```python
|
40 |
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
|
41 |
|
@@ -48,7 +47,7 @@ text = "In this paper, we investigate cross-domain limitations of keyphrase gene
|
|
48 |
namely scientific texts from computer science and biomedical domains and news texts. \
|
49 |
We explore the role of transfer learning between different domains to improve the model performance on small text corpora."
|
50 |
|
51 |
-
#generating keyphrases
|
52 |
tokenized_text = tokenizer.prepare_seq2seq_batch(["<|KEYPHRASES|> " + text], return_tensors='pt')
|
53 |
translation = model.generate(**tokenized_text)
|
54 |
translated_text = tokenizer.batch_decode(translation, skip_special_tokens=True)[0]
|
|
|
34 |
* Krapivin (<a href = "http://eprints.biblio.unitn.it/1671/1/disi09055%2Dkrapivin%2Dautayeu%2Dmarchese.pdf">Krapivin et al., 2009</a>)
|
35 |
* Inspec (<a href = "https://aclanthology.org/W03-1028.pdf">Hulth, 2003</a>)
|
36 |
|
37 |
+
Inspired by <a href = "https://aclanthology.org/2020.findings-emnlp.428.pdf">(Cachola et al., 2020)</a>, we applied control codes to fine-tune BART in a multi-task manner. First, we create a training set containing comma-separated lists of keyphrases and titles as text generation targets. For this purpose, we form text-title and text-keyphrases pairs based on the original text corpus. Second, we append each source text in the training set with control codes <|TITLE|> and <|KEYPHRASES|> respectively. After that, the training set is shuffled in random order. Finally, the preprocessed training set is utilized to fine-tune the pre-trained BART model.
|
|
|
38 |
```python
|
39 |
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
|
40 |
|
|
|
47 |
namely scientific texts from computer science and biomedical domains and news texts. \
|
48 |
We explore the role of transfer learning between different domains to improve the model performance on small text corpora."
|
49 |
|
50 |
+
#generating \n-separated keyphrases
|
51 |
tokenized_text = tokenizer.prepare_seq2seq_batch(["<|KEYPHRASES|> " + text], return_tensors='pt')
|
52 |
translation = model.generate(**tokenized_text)
|
53 |
translated_text = tokenizer.batch_decode(translation, skip_special_tokens=True)[0]
|