File size: 5,759 Bytes
c1622e4 384336f 324bda5 384336f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 |
---
license: cc-by-4.0
datasets:
- abisee/cnn_dailymail
language:
- en
tags:
- NLP
- Text-Summarization
- CNN
metrics:
- rouge
pipeline_tag: summarization
library_name: keras
---
# Seq2Seq Model with Attention for Text Summarization
This repository contains a Sequence-to-Sequence (Seq2Seq) model with attention, trained on the **CNN/DailyMail** dataset for text summarization tasks. The model is built using Keras and leverages pre-trained GloVe embeddings for enhanced word representations. It consists of an encoder-decoder architecture using LSTM layers with attention to capture long-term dependencies.
## Model Architecture
The model follows the classic encoder-decoder structure, with attention to handle long sequences:
- **Embedding Layer**: Uses pre-trained GloVe embeddings (100-dimensional) for both the input (article) and target (summary) texts.
- **Encoder**: A bidirectional LSTM to encode the input sequence. The forward and backward hidden states are concatenated.
- **Decoder**: An LSTM initialized with the encoder's hidden and cell states to generate the target sequence (summary).
- **Attention Mechanism**: While the base code does not explicitly implement attention, this can be easily integrated to improve summarization by focusing on relevant parts of the input sequence during decoding.
### Embeddings
We use GloVe embeddings (100-dimensional) pre-trained on a large corpus of text data. The embedding matrix is constructed for both the input (text) and output (summary) using the GloVe embeddings.
```python
embedding_index = {}
embed_dim = 100
with open('../input/glove6b100dtxt/glove.6B.100d.txt') as f:
for line in f:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embedding_index[word] = coefs
# Embedding for input (articles)
t_embed = np.zeros((t_max_features, embed_dim))
for word, i in t_tokenizer.word_index.items():
vec = embedding_index.get(word)
if i < t_max_features and vec is not None:
t_embed[i] = vec
# Embedding for output (summaries)
s_embed = np.zeros((s_max_features, embed_dim))
for word, i in s_tokenizer.word_index.items():
vec = embedding_index.get(word)
if i < s_max_features and vec is not None:
s_embed[i] = vec
```
## Encoder
A bidirectional LSTM is used for encoding the input text. The forward and backward hidden and cell states are concatenated to pass as the initial states to the decoder.
```python
latent_dim = 128
enc_input = Input(shape=(maxlen_text,))
enc_embed = Embedding(t_max_features, embed_dim, input_length=maxlen_text, weights=[t_embed], trainable=False)(enc_input)
enc_lstm = Bidirectional(LSTM(latent_dim, return_state=True))
enc_output, enc_fh, enc_fc, enc_bh, enc_bc = enc_lstm(enc_embed)
# Concatenate the forward and backward states
enc_h = Concatenate(axis=-1)([enc_fh, enc_bh])
enc_c = Concatenate(axis=-1)([enc_fc, enc_bc])
```
## Decoder
The decoder is an LSTM that takes the encoder's final states as the initial states to generate the output summary sequence.
```python
dec_input = Input(shape=(None,))
dec_embed = Embedding(s_max_features, embed_dim, weights=[s_embed], trainable=False)(dec_input)
dec_lstm = LSTM(latent_dim * 2, return_sequences=True, return_state=True, dropout=0.3, recurrent_dropout=0.2)
dec_outputs, _, _ = dec_lstm(dec_embed, initial_state=[enc_h, enc_c])
# Dense layer with softmax activation for final output
dec_dense = TimeDistributed(Dense(s_max_features, activation='softmax'))
dec_output = dec_dense(dec_outputs)
```
## Model Summary
The full Seq2Seq model with an attention mechanism is compiled using sparse categorical crossentropy loss and the RMSProp optimizer.
### Model Visualization
A diagram of the model is generated using Keras' plot_model function:
![Seq2Seq Encoder-Decoder Model Architecture](./seq2seq_encoder_decoder.png)
## Training
The model is trained with early stopping to prevent overfitting. The model is fit using batches of 128 and a maximum of 10 epochs, with validation data for performance monitoring.
```python
early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=2)
model.fit([train_x, train_y[:, :-1]],
train_y.reshape(train_y.shape[0], train_y.shape[1], 1)[:, 1:],
epochs=10,
callbacks=[early_stop],
batch_size=128,
verbose=2,
validation_data=([val_x, val_y[:, :-1]], val_y.reshape(val_y.shape[0], val_y.shape[1], 1)[:, 1:]))
```
## Dataset
The CNN/DailyMail dataset is used for training and validation. It contains news articles and their corresponding summaries, which makes it suitable for the text summarization task.
- Train set: Used to train the model on article-summary pairs.
- Validation set: Used for model performance evaluation and to apply early stopping.
## Requirements
- Python 3.x
- Keras
- TensorFlow
- NumPy
- GloVe Embeddings
## How to Run
1. Download the CNN/DailyMail dataset and pre-trained GloVe embeddings.
2. Preprocess the dataset and prepare the embedding matrices.
3. Train the model using the provided code.
4. Evaluate the model on a validation set and generate summaries for new text inputs.
## Results
The model generates abstractive summaries of news articles. You can tweak the latent dimensions, embedding sizes, and add attention for improved performance.
## Future Work
* Attention Mechanism: Implementing Bahdanau or Luong Attention for better results.
* Beam Search: Incorporating beam search for enhanced summary generation.
## Resources:
- [Keras Documentation](https://keras.io/)
- [CNN/DailyMail Dataset](https://huggingface.co/datasets/cnn_dailymail)
- [GloVe Embeddings](https://nlp.stanford.edu/projects/glove/) |