File size: 5,682 Bytes
4df3a14
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
# Gupshup
GupShup: Summarizing Open-Domain Code-Switched Conversations EMNLP 2021
Paper: [https://aclanthology.org/2021.emnlp-main.499.pdf](https://aclanthology.org/2021.emnlp-main.499.pdf)
Github: [https://github.com/midas-research/gupshup](https://github.com/midas-research/gupshup)

### Dataset
Please request for the Gupshup data using [this Google form](https://docs.google.com/forms/d/1zvUk7WcldVF3RCoHdWzQPzPprtSJClrnHoIOYbzaJEI/edit?ts=61381ec0). 

Dataset is available for `Hinglish Dilaogues to English Summarization`(h2e) and  `English Dialogues to English Summarization`(e2e). For each task, Dialogues/conversastion have `.source`(train.source) as file extension whereas Summary has `.target`(train.target) file extension. ".source" file need to be provided to `input_path` and ".target" file to `reference_path` argument in the scripts.


## Models
All model weights are available on the Huggingface model hub.  Users can either directly download these weights in their local and provide this path to `model_name` argument in the scripts or use the provided alias (to `model_name` argument) in scripts directly; this will lead to download weights automatically by scripts. 

Model names were aliased in "gupshup_TASK_MODEL" sense, where "TASK" can be h2e,e2e and MODEL can be mbart, pegasus, etc., as listed below.

**1. Hinglish Dialogues to English Summary (h2e)**

| Model   | Huggingface Alias                                                             |
|---------|-------------------------------------------------------------------------------|
| mBART   | [midas/gupshup_h2e_mbart](https://huggingface.co/midas/gupshup_h2e_mbart)     |
| PEGASUS | [midas/gupshup_h2e_pegasus](https://huggingface.co/midas/gupshup_h2e_pegasus) |
| T5 MTL  | [midas/gupshup_h2e_t5_mtl](https://huggingface.co/midas/gupshup_h2e_t5_mtl)   |
| T5      | [midas/gupshup_h2e_t5](https://huggingface.co/midas/gupshup_h2e_t5)           |
| BART    | [midas/gupshup_h2e_bart](https://huggingface.co/midas/gupshup_h2e_bart)       |
| GPT-2   | [midas/gupshup_h2e_gpt](https://huggingface.co/midas/gupshup_h2e_gpt)         |


**2. English Dialogues to English Summary (e2e)**

| Model   | Huggingface Alias                                                             |
|---------|-------------------------------------------------------------------------------|
| mBART   | [midas/gupshup_e2e_mbart](https://huggingface.co/midas/gupshup_e2e_mbart)     |
| PEGASUS | [midas/gupshup_e2e_pegasus](https://huggingface.co/midas/gupshup_e2e_pegasus) |
| T5 MTL  | [midas/gupshup_e2e_t5_mtl](https://huggingface.co/midas/gupshup_e2e_t5_mtl)   |
| T5      | [midas/gupshup_e2e_t5](https://huggingface.co/midas/gupshup_e2e_t5)           |
| BART    | [midas/gupshup_e2e_bart](https://huggingface.co/midas/gupshup_e2e_bart)       |
| GPT-2   | [midas/gupshup_e2e_gpt](https://huggingface.co/midas/gupshup_e2e_gpt)         |

## Inference

### Using command line
1. Clone this repo and create a python virtual environment (https://docs.python.org/3/library/venv.html). Install the required packages using
```
git clone https://github.com/midas-research/gupshup.git
pip install -r requirements.txt
```

2. run_eval script has the following arguments.
*   **model_name** : Path or alias to one of our models available on Huggingface as listed above.
*   **input_path** : Source file or path to file containing conversations, which will be summarized. 
*   **save_path** : File path where to save summaries generated by the model.
*   **reference_path** : Target file or path to file containing summaries, used to calculate matrices.
*  **score_path** : File path where to save scores.
*   **bs** : Batch size
*   **device**: Cuda devices to use.

Please make sure you have downloaded the Gupshup dataset using the above google form and provide the correct path to these files in the argument's `input_path` and `refrence_path.` Or you can simply put `test.source` and `test.target` in `data/h2e/`(hinglish to english) or `data/e2e/`(english to english) folder. For example, to generate English summaries from Hinglish dialogues using the mbart model, run the following command

```
python run_eval.py \
    --model_name midas/gupshup_h2e_mbart \
    --input_path  data/h2e/test.source \
    --save_path generated_summary.txt \
    --reference_path data/h2e/test.target \
    --score_path scores.txt \
    --bs 8

```

Another example, to generate English summaries from English dialogues using the Pegasus model
```
python run_eval.py \
    --model_name midas/gupshup_e2e_pegasus \
    --input_path  data/e2e/test.source \
    --save_path generated_summary.txt \
    --reference_path data/e2e/test.target \
    --score_path scores.txt \
    --bs 8

```


Please create an issue if you are facing any difficulties in replicating the results. 

### References

Please cite [[1]](https://arxiv.org/abs/1910.04073) if you found the resources in this repository useful.


[1] Mehnaz, Laiba, Debanjan Mahata, Rakesh Gosangi, Uma Sushmitha Gunturi, Riya Jain, Gauri Gupta, Amardeep Kumar, Isabelle G. Lee, Anish Acharya, and Rajiv Shah. [*GupShup: Summarizing Open-Domain Code-Switched Conversations*](https://aclanthology.org/2021.emnlp-main.499.pdf)


```
@inproceedings{mehnaz2021gupshup,
  title={GupShup: Summarizing Open-Domain Code-Switched Conversations},
  author={Mehnaz, Laiba and Mahata, Debanjan and Gosangi, Rakesh and Gunturi, Uma Sushmitha and Jain, Riya and Gupta, Gauri and Kumar, Amardeep and Lee, Isabelle G and Acharya, Anish and Shah, Rajiv},
  booktitle={Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing},
  pages={6177--6192},
  year={2021}
}

```