Readme file
Browse files# Arabic and English Translator with Next Token Prediction
This project implements a neural network-based language model for next token prediction in both English and Arabic. It explores natural language processing tasks using RNNs or LSTMs for text generation.
## Project Overview
- **Languages**: English and Arabic
- **Model Architecture**: Recurrent Neural Networks (RNNs) or Long Short-Term Memory networks (LSTMs)
- **Features**:
- Next token prediction
- Text generation
- Checkpoint saving
- Perplexity score tracking
- **Dataset**: Based on the multilingual Alpaca dataset.
## Dataset
The dataset used for this project is derived from the [Multilingual Alpaca Datase]
https: //huggingface.co/datasets/saillab/taco-datasets/tree/main/multilingual-instruction-tuning-dataset/multilingual-alpaca-52k-gpt-4Links, which contains multilingual instruction-tuning examples generated using GPT-4.
It includes high-quality, diverse text samples in multiple languages, including English and Arabic, making it ideal for next-token prediction tasks.
## Files
- `Q8_Evaluation_Arabic_and_English_Translator.ipynb`: Jupyter notebook containing the implementation.
- `README.md`: Project description and usage details.
- `models/`: Pre-trained models and checkpoints (if included).
- `data/`: Training and evaluation datasets.
## Installation
1. Clone this repository:
```bash
git clone <repository-url>
cd <repository-folder>
Install dependencies:
bash
Copy code
pip install -r requirements.txt
How to Use
Open the Jupyter notebook:
bash
Copy code
jupyter notebook Q8_Evaluation_Arabic_and_English_Translator.ipynb
Follow the instructions in the notebook to:
Load the dataset
Train the model
Evaluate performance
Generate text in English and Arabic
Evaluation Metrics
Perplexity: Used to evaluate the model’s performance in predicting the next token.
Text Examples: Generated samples provided for both languages.
Checkpoints
The model checkpoints are saved during training to allow resuming or re-evaluation:
Example: checkpoints/model_epoch_{epoch}.pth
Results
Generated Text: Examples of text outputs in both languages.
Perplexity Scores: Performance evaluation over training epochs.
Hugging Face Integration
The trained model is hosted on Hugging Face. You can download and test it using the link:
Hugging Face Model Repository
Contributing
Feel free to contribute by submitting pull requests or reporting issues.
License
This project is licensed under the MIT License. See the LICENSE file for details.
@@ -9,4 +9,5 @@ base_model:
|
|
9 |
- microsoft/Phi-3.5-mini-instruct
|
10 |
datasets:
|
11 |
- sieu-n/alpaca_eval_multilingual
|
|
|
12 |
---
|
|
|
9 |
- microsoft/Phi-3.5-mini-instruct
|
10 |
datasets:
|
11 |
- sieu-n/alpaca_eval_multilingual
|
12 |
+
pipeline_tag: translation
|
13 |
---
|