File size: 3,900 Bytes

6d6f5c4


# Simple RNN Language Model

This project implements a Recurrent Neural Network (RNN) for character-level language modeling and text generation using PyTorch. The model is designed to predict the next character in a sequence, enabling the generation of coherent text based on a given starting point.

## Requirements

- Python 3.x
- PyTorch
- pandas (for logging losses)
- matplotlib (for loss visualization)
- Hugging Face Datasets (for dataset loading)

## Dataset

The dataset for this project can be created using the **Hugging Face TaCo Dataset**. This dataset contains a variety of textual data that can be used to train the model. The tokenizer and vocabulary need to be generated based on the dataset to encode and decode sequences of text.

## Model Architecture

The model comprises the following layers:
1. **Embedding Layer**: Converts input characters into dense vectors.
2. **LSTM Layers**: Three Long Short-Term Memory (LSTM) layers are used to process the sequences of characters and retain long-term dependencies.
3. **Fully Connected Layers**: Multiple fully connected layers transform the LSTM outputs to predict the next character.
4. **Dropout Layer**: Used to prevent overfitting by randomly zeroing out some of the activations.
5. **Final Output Layer**: A softmax layer is used to output probabilities over the vocabulary for the next character.

## Hyperparameters

The following hyperparameters can be configured:
- **Batch size**: Number of samples processed per training step.
- **Block size**: Length of the input sequences.
- **Learning rate**: Step size for the optimizer.
- **Embedding dimension**: Dimensionality of the character embeddings.
- **Hidden layers sizes**: Sizes of the hidden layers in the LSTM network.
- **Dropout**: Dropout probability for regularization.
- **Maximum iterations**: Maximum number of training iterations.
- **Evaluation interval**: Number of steps between model evaluations.
- **Evaluation iterations**: Number of steps for each evaluation.

## Training Process

The model is trained by splitting the dataset into training and validation sets. The training process involves the following steps:
- Fetching batches of training and validation data.
- Calculating the cross-entropy loss to measure the error between predictions and the actual next character in the sequence.
- Using the AdamW optimizer to minimize the loss.
- Periodically evaluating the model on the validation set to monitor performance.

Checkpoints are saved during training to allow for resuming from a saved state.

## Text Generation

After training, the model can generate text based on a given starting string. The model samples the most likely next character and generates sequences that follow the patterns it has learned from the dataset.

## Perplexity Evaluation

Perplexity is a commonly used metric in language modeling, which measures how well the model predicts the next character in a sequence. Lower perplexity values indicate better performance.

## Visualization

Training and validation losses are logged and can be visualized using matplotlib. The loss curves help in understanding the training dynamics and detecting overfitting or underfitting.

## Checkpoints and Model Saving

The model checkpoints and optimizer states are saved at regular intervals during training. This allows you to resume training from a checkpoint or use the model for inference after training is completed. The saved model contains all necessary information, including the vocabulary, hyperparameters, and trained weights.

## Sample Outputs

After training, the model generates text sequences that follow the patterns learned from the dataset. Example outputs from the model include both coherent sentences and less meaningful sequences, depending on the complexity of the dataset.

## License

This project is open-source and can be freely modified and distributed.