AI Guard Vision Model Card
Overview
This model, AI Guard Vision, is a Vision Transformer (ViT)-based architecture designed for image classification tasks. Its primary objective is to accurately distinguish between real and AI-generated synthetic images. The model addresses the growing challenge of detecting manipulated or fake visual content to preserve trust and integrity in digital media.
Model Summary
- Model Type: Vision Transformer (ViT) –
vit-base-patch16-224
- Objective: Real vs. AI-generated image classification
- License: Apache 2.0
- Fine-tuned From:
google/vit-base-patch16-224
- Training Dataset: CIFake Dataset
- Developer: Aashish Kumar, IIIT Manipur
Applications & Use Cases
- Content Moderation: Identifying AI-generated images across media platforms.
- Digital Forensics: Verifying the authenticity of visual content for investigative purposes.
- Trust Preservation: Helping maintain the integrity of digital ecosystems by combating misinformation spread through fake images.
How to Use the Model
from transformers import AutoImageProcessor, ViTForImageClassification
import torch
from PIL import Image
from pillow_heif import register_heif_opener, register_avif_opener
register_heif_opener()
register_avif_opener()
def get_prediction(img):
image = Image.open(img).convert('RGB')
image_processor = AutoImageProcessor.from_pretrained("AashishKumar/AIvisionGuard-v2")
model = ViTForImageClassification.from_pretrained("AashishKumar/AIvisionGuard-v2")
inputs = image_processor(image, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits
top2_labels = logits.topk(2).indices.squeeze().tolist()
top2_scores = logits.topk(2).values.squeeze().tolist()
response = [{"label": model.config.id2label[label], "score": score} for label, score in zip(top2_labels, top2_scores)]
return response
Dataset Information
The model was fine-tuned on the CIFake dataset, which contains both real and AI-generated synthetic images:
- Real Images: Collected from the CIFAR-10 dataset.
- Fake Images: Generated using Stable Diffusion 1.4.
- Training Data: 100,000 images (50,000 per class).
- Testing Data: 20,000 images (10,000 per class).
Model Architecture
- Transformer Encoder Layers: Utilizes self-attention mechanisms.
- Positional Encodings: Helps the model understand image structure.
- Pretrained Weights: Pretrained on ImageNet-21k and fine-tuned on ImageNet 2012 for enhanced performance.
Why Vision Transformer?
- Scalability and Performance: Excels at high-level global feature extraction.
- State-of-the-Art Accuracy: Leverages transformers to outperform traditional CNN models.
Training Details
- Learning Rate: 0.0000001
- Batch Size: 64
- Epochs: 100
- Training Time: 1 hr 36 min
Evaluation Metrics
The model was evaluated using the CIFake test dataset, with the following metrics:
- Accuracy: 92%
- F1 Score: 0.89
- Precision: 0.85
- Recall: 0.88
Model | Accuracy | F1-Score | Precision | Recall |
---|---|---|---|---|
Baseline | 85% | 0.82 | 0.78 | 0.80 |
Augmented | 88% | 0.85 | 0.83 | 0.84 |
Fine-tuned ViT | 92% | 0.89 | 0.85 | 0.88 |
Evaluation Fig:
System Workflow
- Frontend: ReactJS
- Backend: Python Flask
- Database: PostgreSQL(Supabase)
- Model: Deployed via Pytorch and TensorFlow frameworks
Strengths and Limitations
Strengths:
- High Accuracy: Achieves state-of-the-art performance in distinguishing real and synthetic images.
- Pretrained on ImageNet-21k: Allows for efficient transfer learning and robust generalization.
Limitations:
- Synthetic Image Diversity: The model may underperform on novel or unseen synthetic images that are significantly different from the training data.
- Data Bias: Like all machine learning models, its predictions may reflect biases present in the training data.
Conclusion and Future Work
This model provides a highly effective tool for detecting AI-generated synthetic images and has promising applications in content moderation, digital forensics, and trust preservation. Future improvements may include:
- Hybrid Architectures: Combining transformers with convolutional layers for improved performance.
- Multimodal Detection: Incorporating additional modalities (e.g., metadata or contextual information) for more comprehensive detection.
- Downloads last month
- 27
Model tree for AashishKumar/AIvisionGuard-v2
Base model
google/vit-base-patch16-224