Final Project in AI Engineering, DVAE26, ht24

by Marlene Kulowatz

Instructions

This project provides practical experience in machine learning, covering aspects from data quality and preprocessing to model development and deployment. The task is designed to highlight the importance of data quality in AI and also to engage students in the entire process of machine learning model development, including software engineering best practices and deployment on a modern platform like Hugging Face.

Objective: Develop a model for image recognition.

Key Considerations:

Data Augmentation: Enhance the dataset's diversity and robustness through augmentation techniques.
Model Architecture: Select or design a convolutional neural network (CNN) for image classification.
Evaluation Metrics: Use appropriate metrics like accuracy, precision, and recall for image-related tasks.
SE Best Practices: Follow SE best practices for code quality, including modularization and version control.

Dataset: MNIST dataset (Handwritten digits)
Model Development: Develop and evaluate the model, focusing on accuracy, efficiency, and interpretability while respecting SE best practices.
Deployment: Deploy the model on the Hugging Face platform and showcase its application.
Data Quality Report: Include a detailed analysis of data quality, challenges faced, and measures taken to ensure data integrity.
Report and Presentation: Include workflow pipeline, model development, data quality analysis, SE best practices, and key findings.

Final Deliverables:
Well-Documented Code Repository

Workflow Pipeline

Data Acquition

First, I downloaded the whole dataset, added an input folder and accessed the data locally. But this was optimized, to make the data requisition easy and reproducible, I use the dataset provided by keras.

Data Preprocessing

The images were sized to 28x28 pixels and normalized. Also, the data was augmented using DataImageGenerator from keras. The following considerations were made:

flipping the images horizontally does not make sense because 6 and 9 are too similar
because of the same reason, the rotation cannot be too big

Model Development

For the non-augmented dataset, a simple CNN model with only few layers already achieves great results. For the augmented dataset, more layers were added to handle the added complexity. The following layers were chosen:

Conv2D Layers

32 filters with a 3x3 kernel -> captures basic features like edges and textures in 28x28 pixel images from MNIST (1st layer)
uses ReLU activation to help the model learn patterns.
input_shape=(28, 28, 1) tells the model to expect grayscale images of size 28x28.
64 filters to capture more complex features (2nd layer)

MaxPooling2D Layers

reduces the image size and lowers computational cost.
helps the model handle small changes in image position, which is useful for augmented data (rotations, translations, zooms)

Flatten Layer

turns the 2D feature maps into a 1D vector to feed into the fully connected layers.

Dense Layer

helps the model learn complex patterns from the features extracted earlier

Dropout Layer

randomly sets half of the units to zero during training
prevents overfitting, especially with augmented data
helps the model generalize better

Output Layer

gives the probabilities for each of the 10 classes (digits 0-9 in MNIST)

Evaluation

The model was evaluated on the following metrics:

accuracy

Accuracy is the best measure for evaluating image recognition models, particularly when comparing different Convolutional Neural Networks (CNNs), because it directly reflects the model's ability to correctly identify objects or patterns within an image.

Deployment

Finally, the code was deployed on the platform "Hugging Face". It is publicly accessible and can be found by searching for "maykulo" and "final_project_ai_engineering". The repo can also be cloned by typing: "git clone https://huggingface.co/maykulo/final_project_ai_engineering".

Data Quality Analysis

analysis of data quality

The MNIST dataset is a well known dataset and often used for benchmarking. It consists of a collection of 70000 handwritten digits. Each image is a 28x28 pixel grayscale square. The data is already divided into a training set (60000 images) and a test set (10000 images). Each image is associated with a corresponding label (0, 1, 2, 3, 4, 5, 6, 7, 8 or 9).

challenges faced

The data has to be transformed before it can be used. The grayscale has to be adjusted and it's important to always make sure the input is correctly formatted for the designed model.

measures taken to ensure data integrity

The data quality was evaluated throughout the whole process. Visual verifications were added at every step, visualizing parts of the original data set, the scaling and the augmented dataset. Also, the format of the data was checked using unit tests.

SE Best Practices

Modular Code: Reusable functions for data preprocessing and model creation were created.
Documentation: The Code was well documented and the project maintains a README file including detailed project objectives and results.
Hyperparameter Tuning: Different configurations for learning rate, optimizer, and architecture were tested.
Testing: Unit tests were implemented, as well as many logs.
Version Control: The jupyter notebooks where uploaded in a repository and constantly updated, using commit messages.

Key Outcomes

Results

By applying a simple cnn on the original, unaugmented dataset, good results were already achieved. As soon as augmentation comes into play, the same model that performed so well before suddenly performs worse though. The augmented dataset achieves relatively high results when being fed to a more complex model though, very similar to the unaugmented set in combination with a simple CNN.

Interpretation

The results indicate that the simple CNN is too basic to fully benefit from data augmentation, while a more complex CNN is better suited for augmented data. When data augmentation is used, the dataset becomes more diverse. Simpler models may excel in situations where the data is already simple.

Trade-Off Between Simplicity and Generalization There’s a trade-off here between model simplicity (which works well for non-augmented data) and complexity (which handles augmented data better)

general

On a personal level, this was the first time I applied and implemented a CNN. It was helpful to deepen my understanding to watch how much longer the calculations take as soon as a single layer is added. It was also impressive to experience how small changes in the model architecture can have a big impact on the outcomes. Also, I learned that the field is very big and there are many more things to learn.