kdvisdjf rkeogkw commited on
Commit
1ef3e9a
·
1 Parent(s): 032c20a

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +70 -0
README.md ADDED
@@ -0,0 +1,70 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ datasets:
3
+ - midas/krapivin
4
+ - midas/inspec
5
+ language:
6
+ - en
7
+
8
+ widget:
9
+ - text: "<|KEYPHRASES|> In this paper, we investigate cross-domain limitations of keyphrase generation using the models for abstractive text summarization. We present an evaluation of BART fine-tuned for keyphrase generation across three types of texts, namely scientific texts from computer science and biomedical domains and news texts. We explore the role of transfer learning between different domains to improve the model performance on small text corpora."
10
+ - text: "<|TITLE|> In this paper, we investigate cross-domain limitations of keyphrase generation using the models for abstractive text summarization. We present an evaluation of BART fine-tuned for keyphrase generation across three types of texts, namely scientific texts from computer science and biomedical domains and news texts. We explore the role of transfer learning between different domains to improve the model performance on small text corpora."
11
+ - text: "<|KEYPHRASES|> Relevance has traditionally been linked with feature subset selection, but formalization of this link has not been attempted. In this paper, we propose two axioms for feature subset selection sufficiency axiom and necessity axiombased on which this link is formalized: The expected feature subset is the one which maximizes relevance. Finding the expected feature subset turns out to be NP-hard. We then devise a heuristic algorithm to find the expected subset which has a polynomial time complexity. The experimental results show that the algorithm finds good enough subset of features which, when presented to C4.5, results in better prediction accuracy."
12
+ - text: "<|TITLE|> Relevance has traditionally been linked with feature subset selection, but formalization of this link has not been attempted. In this paper, we propose two axioms for feature subset selection sufficiency axiom and necessity axiombased on which this link is formalized: The expected feature subset is the one which maximizes relevance. Finding the expected feature subset turns out to be NP-hard. We then devise a heuristic algorithm to find the expected subset which has a polynomial time complexity. The experimental results show that the algorithm finds good enough subset of features which, when presented to C4.5, results in better prediction accuracy."
13
+ ---
14
+
15
+ # BART fine-tuned for keyphrase generation
16
+
17
+ <!-- Provide a quick summary of what the model is/does. -->
18
+
19
+ This is the <a href="https://huggingface.co/facebook/bart-base">bart-base</a> (<a href = "https://arxiv.org/abs/1910.13461">Lewis et al.. 2019</a>) model <a href="https://arxiv.org/abs/2209.03791">finetuned</a> for generating titles and keyphrases for scientific texts on the following corpora:
20
+
21
+ * Krapivin (<a href = "http://eprints.biblio.unitn.it/1671/1/disi09055%2Dkrapivin%2Dautayeu%2Dmarchese.pdf">Krapivin et al., 2009</a>)
22
+ * Inspec (<a href = "https://aclanthology.org/W03-1028.pdf">Hulth, 2003</a>)
23
+
24
+ Inspired by <a href = "https://aclanthology.org/2020.findings-emnlp.428.pdf">(Cachola et al., 2020)</a>, we applied control codes to fine-tune BART in a multi-task manner. First, we create a training set containing comma-separated lists of keyphrases and titles as text generation targets. For this purpose, we form text-title and text-keyphrases pairs based on the original text corpus. Second, we append each source text in the training set with control codes <|TITLE|> and <|KEYPHRASES|> respectively. After that, the training set is shuffled in random order. Finally, the preprocessed training set is utilized to fine-tune the pre-trained BART model
25
+
26
+ ```python
27
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
28
+
29
+ tokenizer = AutoTokenizer.from_pretrained("beogradjanka/bart_multitask_finetuned_for_title_and_keyphrase_generation")
30
+ model = AutoModelForSeq2SeqLM.from_pretrained("beogradjanka/bart_multitask_finetuned_for_title_and_keyphrase_generation")
31
+
32
+
33
+ text = "In this paper, we investigate cross-domain limitations of keyphrase generation using the models for abstractive text summarization.\
34
+ We present an evaluation of BART fine-tuned for keyphrase generation across three types of texts, \
35
+ namely scientific texts from computer science and biomedical domains and news texts. \
36
+ We explore the role of transfer learning between different domains to improve the model performance on small text corpora."
37
+
38
+ #generating keyphrases
39
+ tokenized_text = tokenizer.prepare_seq2seq_batch(["<|KEYPHRASES|> " + text], return_tensors='pt')
40
+ translation = model.generate(**tokenized_text)
41
+ translated_text = tokenizer.batch_decode(translation, skip_special_tokens=True)[0]
42
+ print(translated_text)
43
+
44
+ #generating title
45
+ tokenized_text = tokenizer.prepare_seq2seq_batch(["<|TITLE|> " + text], return_tensors='pt')
46
+ translation = model.generate(**tokenized_text)
47
+ translated_text = tokenizer.batch_decode(translation, skip_special_tokens=True)[0]
48
+ print(translated_text)
49
+ ```
50
+
51
+ #### Training Hyperparameters
52
+
53
+ The following hyperparameters were used during training:
54
+
55
+ * learning_rate: 4e-5
56
+ * train_batch_size: 8
57
+ * optimizer: AdamW with betas=(0.9,0.999) and epsilon=1e-08
58
+ * num_epochs: 3
59
+
60
+ **BibTeX:**
61
+
62
+ ```
63
+ @article{glazkova2022applying,
64
+ title={Applying transformer-based text summarization for keyphrase generation},
65
+ author={Glazkova, Anna and Morozov, Dmitry},
66
+ journal={arXiv preprint arXiv:2209.03791},
67
+ year={2022}
68
+ }
69
+ ```
70
+