andythetechnerd03 commited on
Commit
463733c
·
verified ·
1 Parent(s): a2bb048

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +134 -0
README.md ADDED
@@ -0,0 +1,134 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Vietnamese Text Summarization with Poem
2
+ Summarize a piece of text with poem. Doesn't it sound fun? </br>
3
+
4
+ ## Introduction
5
+
6
+ Jokes aside, this is a fun project by my team at FPT University about fine-tuning a Large Language Model (LLM) at summarizing a piece of long Vietnamese text in the form of **poems**. We call the model **VistralPoem5**. </br>
7
+ Here's a little example:
8
+ ![image](/assets/example_data_transformed.png)
9
+
10
+ ## HuggingFace 🤗
11
+ ``` python
12
+ from transformers import AutoTokenizer, AutoModelForCausalLM
13
+
14
+ model_name = "pphuc25/VistralPoem5"
15
+ tokenizer = AutoTokenizer.from_pretrained(model_name, device_map="auto")
16
+ model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
17
+
18
+ inputs = [
19
+ {"role": "system", "content": "Bạn là một nhà thơ chuyên nghiệp, nhiệm vụ của bạn là chuyển bài văn này thành 1 bài thơ 5 chữ từ khoảng 1 đến 3 khổ"},
20
+ {"role": "user", "content": "nhớ tới lời mẹ dặn\nsợ mẹ buồn con đau\nnên tự mình đứng dậy\nnhanh như có phép màu"}
21
+ ]
22
+
23
+ input_ids = tokenizer.apply_chat_template(inputs, return_tensors="pt").to(model.device)
24
+ outputs = model.generate(
25
+ input_ids=input_ids,
26
+ max_new_tokens=200,
27
+ do_sample=True,
28
+ top_p=0.95,
29
+ top_k=20,
30
+ temperature=0.1,
31
+ repetition_penalty=1.05,
32
+ )
33
+
34
+ output_str = tokenizer.batch_decode(outputs[:, input_ids.size(1): ], skip_special_tokens=True)[0].strip()
35
+ print(output_str)
36
+ ```
37
+
38
+ ## Fine-tuning
39
+
40
+ [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/andythetechnerd03/Vietnamese-Text-Summarization-Poem/blob/main/notebooks/fine_tune_with_axolotl.ipynb)
41
+
42
+ This is not an easy task. The model we are using is a Vietnamese version of the popular [Mistral-7B](https://arxiv.org/abs/2310.06825) with 7 billion parameters. Obviously, it is very computationally expensive to fine-tune, therefore we applied various state-of-the-art optimization techniques:
43
+ - [Flash Attention](https://github.com/Dao-AILab/flash-attention): helps reduce computation complexity of Attention from $O(n^2)$ to $O(n\log n)$
44
+ - [QLoRA (Quantized Low-Rank Adaptation)](https://arxiv.org/abs/2305.14314): train a smaller "adapter" which is a low-rank weight matrices, allowing for less computation. Furthermore, the base model is quantized to only `4-bit`, this is great for storing large models.
45
+ - [Mixed Precision Training](https://arxiv.org/abs/1710.03740): here we combine `float32` with `bfloat16` data type for faster training.
46
+
47
+ To train the LLM seamlessly as possible, we used a popular open-source fine-tuning platform called [Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl). This platform helps you declare the parameters and config and train quickly without much code.
48
+
49
+ ### Code for fine-tuning model
50
+ To customize the configuration, you can modify the `create_file_config.py` file. After making your changes, run the script to generate a personalized configuration file. The following is an example of how to execute the model training:
51
+ ``` python
52
+ cd src
53
+ export PYTHONPATH="$PWD"
54
+ accelerate launch -m axolotl.cli.train config.yaml
55
+ ```
56
+
57
+ ## Data
58
+ This is not easy. Such data that takes the input as a long text (newspaper article, story) and output a poem is very hard to find. So we created our own... by using *prompt engineering*.
59
+
60
+ - The collection of poems is straightforward. There are many repositories and prior works that collected a handful of Vietnamese poems, as well as publicly available samples online. We collected from [FPT Software AI Lab](https://github.com/fsoft-ailab/Poem-Generator) and [HuggingFace](https://github.com/fsoft-ailab/Poem-Generator).
61
+ - From the poems we use prompt engineering to ask our base model to generate a story from such poem. The prompt is in the form </br>
62
+ ``` Bạn là một nhà kể chuyện phiếm, nhiệm vụ của bạn là hãy kể 1 câu chuyện đơn giản và ngắn gọn từ một bài thơ, câu chuyện nên là 1 bài liền mạch, thực tế\n\n{insert poem here}```
63
+ - Speaking of prompt engineering, there is another prompt to generate poem from context. </br>
64
+ ```Bạn là một nhà thơ chuyên nghiệp, nhiệm vụ của bạn là chuyển bài văn này thành 1 bài thơ 5 chữ từ khoảng 1 đến 3 khổ: \n {insert context here}```
65
+ - The pre-processing step is faily simple. A bit of lowercase here, punctuation removal there, plus reducing poems to 1-3 random paragraphs, and we are done.
66
+
67
+ After all, we have about 72,101 samples with a ratio of 0.05 (68495 on the train set and 3606 on the test set)
68
+
69
+ We published the dataset at [here](https://huggingface.co/datasets/pphuc25/poem-5-words-vietnamese)
70
+
71
+ ### Custom Evaluation Data
72
+ As part of the final evaluation for benchmark, we gathered around 27 Vietnamese children's stories and divided into many samples, accumulating to 118 samples. The dataset can be found [here](/data/eval_set.json)
73
+
74
+ ## Model
75
+ As mentioned earlier, we use [Vistral-7B-Chat](https://huggingface.co/Viet-Mistral/Vistral-7B-Chat) as the base model and we fine-tune it on our curated dataset earlier. Here's a few configurations:
76
+ - The model is based on Transformer’s decoder-only architechture:
77
+ - Number of Attention Heads: 32
78
+ - Hidden Size: 4096
79
+ - Vocab size: 38369
80
+ - Data type: bfloat16
81
+ - Number of Hidden Layers (Nx): 32
82
+ - Loss function: Cross-entropy
83
+ - Parameter-Efficient Finetuning: QLora
84
+ - 4 bit
85
+ - Alpha: 16
86
+ - Rank: 32
87
+ - Target: Linear
88
+ - Gradient accumulation: 4
89
+ - Learning Rate: 0.0002
90
+ - Warmup Steps: 10
91
+ - LR Scheduler: Cosine
92
+ - Max Steps: 400
93
+ - Batch size: 16
94
+ - Optimizer: Adamw bnb 8bit
95
+ - Sequence Len: 1096
96
+
97
+ The weights can be found [here](https://huggingface.co/pphuc25/poem-vistral)
98
+
99
+ The notebook for training can be found at `notebook/Fine_tune_LLMs_with_Axolotl.ipynb`
100
+
101
+ ## Benchmark
102
+ We used the custom evaluation dataset to perform benchmark. Since popular metrics such as ROUGE is not applicable to poem format, we chose a simpler approach - counting the probability of 5-word poems in the result. </br>
103
+ Here's the result:
104
+ | Model | Number of Parameters | Hardware | Probability of 5-word(Higher is better) | Average inference time(Lower is better) |
105
+ |----------------------------|----------------------|----------------------|-----------------------------------------|-----------------------------------------|
106
+ | Vistral-7B-Chat (baseline) | 7B | 1x Nvidia Tesla A100 | 4.15% | 6.75s |
107
+ | Google Gemini Pro* | > 100B | **Multi-TPU** | 18.3% | 3.4s |
108
+ | **VistralPoem5 (Ours)** | **7B** | 1x Nvidia Tesla A100 | **61.4%** | **3.14s** |
109
+
110
+ &ast; API call, meaning inference time may be affected
111
+
112
+ The benchmark code can be found at `notebook/infer_poem_model.ipynb` and `notebook/probability_5word.ipynb`
113
+
114
+
115
+ ## Deployment
116
+ We used Gradio for fast deployment on Google Colab. It should be in `notebook/infer_poem_model.ipynb` as well.
117
+ ![Screenshot 2024-03-09 185803](https://github.com/andythetechnerd03/Vietnamese-Poem-Summarization/assets/101492362/8bd94ed1-bb67-48fb-924e-17ad320e3005)
118
+
119
+ Docker Image, coming soon...
120
+ ## Future Work
121
+ - [ ] Make a custom loss function to align rhythm and tones.
122
+ - [ ] Use a better metric for evaluating poems (rhythm and content summarization)
123
+ - [ ] Use RLHF to align poems with human values.
124
+ - [ ] And more...
125
+
126
+ ## Credits
127
+ - [Phan Phuc](https://github.com/pphuc25) for doing the fine-tuning.
128
+ - [Me](https://github.com/andythetechnerd03) for designing the pipeline and testing the model.
129
+ - [Truong Vo](https://github.com/justinvo277) for collecting the data.
130
+
131
+
132
+
133
+
134
+