AI Guard Vision Model Card

License: Apache 2.0

Overview

This model, AI Guard Vision, is a Vision Transformer (ViT)-based architecture designed for image classification tasks. Its primary objective is to accurately distinguish between real and AI-generated synthetic images. The model addresses the growing challenge of detecting manipulated or fake visual content to preserve trust and integrity in digital media.

Model Summary

  • Model Type: Vision Transformer (ViT) – vit-base-patch16-224
  • Objective: Real vs. AI-generated image classification
  • License: Apache 2.0
  • Fine-tuned From: google/vit-base-patch16-224
  • Training Dataset: CIFake Dataset
  • Developer: Aashish Kumar, IIIT Manipur

Applications & Use Cases

  • Content Moderation: Identifying AI-generated images across media platforms.
  • Digital Forensics: Verifying the authenticity of visual content for investigative purposes.
  • Trust Preservation: Helping maintain the integrity of digital ecosystems by combating misinformation spread through fake images.

How to Use the Model

from transformers import AutoImageProcessor, ViTForImageClassification
import torch
from PIL import Image
from pillow_heif import register_heif_opener, register_avif_opener

register_heif_opener()
register_avif_opener()

def get_prediction(img):
    image = Image.open(img).convert('RGB')
    image_processor = AutoImageProcessor.from_pretrained("AashishKumar/AIvisionGuard-v2")
    model = ViTForImageClassification.from_pretrained("AashishKumar/AIvisionGuard-v2")
    inputs = image_processor(image, return_tensors="pt")
    
    with torch.no_grad():
        logits = model(**inputs).logits
    
    top2_labels = logits.topk(2).indices.squeeze().tolist()
    top2_scores = logits.topk(2).values.squeeze().tolist()
    
    response = [{"label": model.config.id2label[label], "score": score} for label, score in zip(top2_labels, top2_scores)]
    return response

Dataset Information

The model was fine-tuned on the CIFake dataset, which contains both real and AI-generated synthetic images:

  • Real Images: Collected from the CIFAR-10 dataset.
  • Fake Images: Generated using Stable Diffusion 1.4.
  • Training Data: 100,000 images (50,000 per class).
  • Testing Data: 20,000 images (10,000 per class).

Model Architecture

  • Transformer Encoder Layers: Utilizes self-attention mechanisms.
  • Positional Encodings: Helps the model understand image structure.
  • Pretrained Weights: Pretrained on ImageNet-21k and fine-tuned on ImageNet 2012 for enhanced performance.

Why Vision Transformer?

  • Scalability and Performance: Excels at high-level global feature extraction.
  • State-of-the-Art Accuracy: Leverages transformers to outperform traditional CNN models.

Training Details

  • Learning Rate: 0.0000001
  • Batch Size: 64
  • Epochs: 100
  • Training Time: 1 hr 36 min

Evaluation Metrics

The model was evaluated using the CIFake test dataset, with the following metrics:

  • Accuracy: 92%
  • F1 Score: 0.89
  • Precision: 0.85
  • Recall: 0.88
Model Accuracy F1-Score Precision Recall
Baseline 85% 0.82 0.78 0.80
Augmented 88% 0.85 0.83 0.84
Fine-tuned ViT 92% 0.89 0.85 0.88

Evaluation Fig:

image/png

System Workflow

  • Frontend: ReactJS
  • Backend: Python Flask
  • Database: PostgreSQL(Supabase)
  • Model: Deployed via Pytorch and TensorFlow frameworks

Strengths and Limitations

Strengths:

  • High Accuracy: Achieves state-of-the-art performance in distinguishing real and synthetic images.
  • Pretrained on ImageNet-21k: Allows for efficient transfer learning and robust generalization.

Limitations:

  • Synthetic Image Diversity: The model may underperform on novel or unseen synthetic images that are significantly different from the training data.
  • Data Bias: Like all machine learning models, its predictions may reflect biases present in the training data.

Conclusion and Future Work

This model provides a highly effective tool for detecting AI-generated synthetic images and has promising applications in content moderation, digital forensics, and trust preservation. Future improvements may include:

  • Hybrid Architectures: Combining transformers with convolutional layers for improved performance.
  • Multimodal Detection: Incorporating additional modalities (e.g., metadata or contextual information) for more comprehensive detection.
Downloads last month
27
Safetensors
Model size
85.8M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for AashishKumar/AIvisionGuard-v2

Finetuned
(539)
this model

Dataset used to train AashishKumar/AIvisionGuard-v2