File size: 8,093 Bytes
b6a6020 4b44f2b b6a6020 4b44f2b b6a6020 4b44f2b 463733c eb9e559 463733c eb9e559 463733c eb9e559 463733c b6a6020 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 |
---
license: mit
datasets:
- andythetechnerd03/Vietnamese-Poem-5words
language:
- vi
library_name: transformers
tags:
- art
- text-generation-inference
base_model: Viet-Mistral/Vistral-7B-Chat
---
# Vietnamese Text Summarization with Poem
Summarize a piece of text with poem. Doesn't it sound fun? </br>
## Introduction
Jokes aside, this is a fun project by my team at FPT University about fine-tuning a Large Language Model (LLM) at summarizing a piece of long Vietnamese text in the form of **poems**. We call the model **VistralPoem5**. </br>
Here's a little example:
![image](/assets/example_data_transformed.png)
## HuggingFace 🤗
``` python
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "andythetechnerd03/VistralPoem5"
tokenizer = AutoTokenizer.from_pretrained(model_name, device_map="auto")
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
inputs = [
{"role": "system", "content": "Bạn là một nhà thơ chuyên nghiệp, nhiệm vụ của bạn là chuyển bài văn này thành 1 bài thơ 5 chữ từ khoảng 1 đến 3 khổ"},
{"role": "user", "content": "nhớ tới lời mẹ dặn\nsợ mẹ buồn con đau\nnên tự mình đứng dậy\nnhanh như có phép màu"}
]
input_ids = tokenizer.apply_chat_template(inputs, return_tensors="pt").to(model.device)
outputs = model.generate(
input_ids=input_ids,
max_new_tokens=200,
do_sample=True,
top_p=0.95,
top_k=20,
temperature=0.1,
repetition_penalty=1.05,
)
output_str = tokenizer.batch_decode(outputs[:, input_ids.size(1): ], skip_special_tokens=True)[0].strip()
print(output_str)
```
## Fine-tuning
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/andythetechnerd03/Vietnamese-Text-Summarization-Poem/blob/main/notebooks/fine_tune_with_axolotl.ipynb)
This is not an easy task. The model we are using is a Vietnamese version of the popular [Mistral-7B](https://arxiv.org/abs/2310.06825) with 7 billion parameters. Obviously, it is very computationally expensive to fine-tune, therefore we applied various state-of-the-art optimization techniques:
- [Flash Attention](https://github.com/Dao-AILab/flash-attention): helps reduce computation complexity of Attention from $O(n^2)$ to $O(n\log n)$
- [QLoRA (Quantized Low-Rank Adaptation)](https://arxiv.org/abs/2305.14314): train a smaller "adapter" which is a low-rank weight matrices, allowing for less computation. Furthermore, the base model is quantized to only `4-bit`, this is great for storing large models.
- [Mixed Precision Training](https://arxiv.org/abs/1710.03740): here we combine `float32` with `bfloat16` data type for faster training.
To train the LLM seamlessly as possible, we used a popular open-source fine-tuning platform called [Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl). This platform helps you declare the parameters and config and train quickly without much code.
### Code for fine-tuning model
To customize the configuration, you can modify the `create_file_config.py` file. After making your changes, run the script to generate a personalized configuration file. The following is an example of how to execute the model training:
``` python
cd src
export PYTHONPATH="$PWD"
accelerate launch -m axolotl.cli.train config.yaml
```
## Data
This is not easy. Such data that takes the input as a long text (newspaper article, story) and output a poem is very hard to find. So we created our own... by using *prompt engineering*.
- The collection of poems is straightforward. There are many repositories and prior works that collected a handful of Vietnamese poems, as well as publicly available samples online. We collected from [FPT Software AI Lab](https://github.com/fsoft-ailab/Poem-Generator) and [HuggingFace](https://github.com/fsoft-ailab/Poem-Generator).
- From the poems we use prompt engineering to ask our base model to generate a story from such poem. The prompt is in the form </br>
``` Bạn là một nhà kể chuyện phiếm, nhiệm vụ của bạn là hãy kể 1 câu chuyện đơn giản và ngắn gọn từ một bài thơ, câu chuyện nên là 1 bài liền mạch, thực tế\n\n{insert poem here}```
- Speaking of prompt engineering, there is another prompt to generate poem from context. </br>
```Bạn là một nhà thơ chuyên nghiệp, nhiệm vụ của bạn là chuyển bài văn này thành 1 bài thơ 5 chữ từ khoảng 1 đến 3 khổ: \n {insert context here}```
- The pre-processing step is faily simple. A bit of lowercase here, punctuation removal there, plus reducing poems to 1-3 random paragraphs, and we are done.
After all, we have about 72,101 samples with a ratio of 0.05 (68495 on the train set and 3606 on the test set)
We published the dataset at [here](https://huggingface.co/datasets/andythetechnerd03/Vietnamese-Poem-5words)
### Custom Evaluation Data
As part of the final evaluation for benchmark, we gathered around 27 Vietnamese children's stories and divided into many samples, accumulating to 118 samples. The dataset can be found [here](/data/eval_set.json)
## Model
As mentioned earlier, we use [Vistral-7B-Chat](https://huggingface.co/Viet-Mistral/Vistral-7B-Chat) as the base model and we fine-tune it on our curated dataset earlier. Here's a few configurations:
- The model is based on Transformer’s decoder-only architechture:
- Number of Attention Heads: 32
- Hidden Size: 4096
- Vocab size: 38369
- Data type: bfloat16
- Number of Hidden Layers (Nx): 32
- Loss function: Cross-entropy
- Parameter-Efficient Finetuning: QLora
- 4 bit
- Alpha: 16
- Rank: 32
- Target: Linear
- Gradient accumulation: 4
- Learning Rate: 0.0002
- Warmup Steps: 10
- LR Scheduler: Cosine
- Max Steps: 400
- Batch size: 16
- Optimizer: Adamw bnb 8bit
- Sequence Len: 1096
The weights can be found [here](https://huggingface.co/andythetechnerd03/VistralPoem5)
The notebook for training can be found at `notebook/Fine_tune_LLMs_with_Axolotl.ipynb`
## Benchmark
We used the custom evaluation dataset to perform benchmark. Since popular metrics such as ROUGE is not applicable to poem format, we chose a simpler approach - counting the probability of 5-word poems in the result. </br>
Here's the result:
| Model | Number of Parameters | Hardware | Probability of 5-word(Higher is better) | Average inference time(Lower is better) |
|----------------------------|----------------------|----------------------|-----------------------------------------|-----------------------------------------|
| Vistral-7B-Chat (baseline) | 7B | 1x Nvidia Tesla A100 | 4.15% | 6.75s |
| Google Gemini Pro* | > 100B | **Multi-TPU** | 18.3% | 3.4s |
| **VistralPoem5 (Ours)** | **7B** | 1x Nvidia Tesla A100 | **61.4%** | **3.14s** |
* API call, meaning inference time may be affected
The benchmark code can be found at `notebook/infer_poem_model.ipynb` and `notebook/probability_5word.ipynb`
## Deployment
We used Gradio for fast deployment on Google Colab. It should be in `notebook/infer_poem_model.ipynb` as well.
![Screenshot 2024-03-09 185803](https://github.com/andythetechnerd03/Vietnamese-Poem-Summarization/assets/101492362/8bd94ed1-bb67-48fb-924e-17ad320e3005)
Docker Image, coming soon...
## Future Work
- [ ] Make a custom loss function to align rhythm and tones.
- [ ] Use a better metric for evaluating poems (rhythm and content summarization)
- [ ] Use RLHF to align poems with human values.
- [ ] And more...
## Credits
- [Phan Phuc](https://github.com/pphuc25) for doing the fine-tuning.
- [Me](https://github.com/andythetechnerd03) for designing the pipeline and testing the model.
- [Truong Vo](https://github.com/justinvo277) for collecting the data. |