|
--- |
|
language: es |
|
tags: |
|
- sagemaker |
|
- vit |
|
- ImageClassification |
|
- generated_from_trainer |
|
license: apache-2.0 |
|
datasets: |
|
- cifar100 |
|
metrics: |
|
- accuracy |
|
model-index: |
|
- name: vit_base-224-in21k-ft-cifar100 |
|
results: |
|
- task: |
|
name: Image Classification |
|
type: image-classification |
|
dataset: |
|
name: "Cifar100" |
|
type: cifar100 |
|
metrics: |
|
- name: Accuracy |
|
type: accuracy |
|
value: 0.9148 |
|
--- |
|
|
|
# Model vit_base-224-in21k-ft-cifar100 |
|
|
|
## **A finetuned model for Image classification in Spanish** |
|
|
|
This model was trained using Amazon SageMaker and the Hugging Face Deep Learning container, |
|
The base model is **Vision Transformer (base-sized model)** which is a transformer encoder model (BERT-like) pretrained on a large collection of images in a supervised fashion, namely ImageNet-21k, at a resolution of 224x224 pixels.[Link to base model](https://huggingface.co/google/vit-base-patch16-224-in21k) |
|
|
|
## Base model citation |
|
### BibTeX entry and citation info |
|
|
|
```bibtex |
|
@misc{wu2020visual, |
|
title={Visual Transformers: Token-based Image Representation and Processing for Computer Vision}, |
|
author={Bichen Wu and Chenfeng Xu and Xiaoliang Dai and Alvin Wan and Peizhao Zhang and Zhicheng Yan and Masayoshi Tomizuka and Joseph Gonzalez and Kurt Keutzer and Peter Vajda}, |
|
year={2020}, |
|
eprint={2006.03677}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CV} |
|
} |
|
``` |
|
|
|
## Dataset |
|
[Link to dataset description](http://www.cs.toronto.edu/~kriz/cifar.html) |
|
|
|
The CIFAR-10 and CIFAR-100 are labeled subsets of the 80 million tiny images dataset. They were collected by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton |
|
|
|
|
|
The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. |
|
This dataset,CIFAR100, is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). |
|
|
|
Sizes of datasets: |
|
- Train dataset: 50,000 |
|
- Test dataset: 10,000 |
|
|
|
|
|
## Intended uses & limitations |
|
|
|
This model is intented for Image Classification. |
|
|
|
|
|
## Hyperparameters |
|
{ |
|
"epochs": "5", |
|
"train_batch_size": "32", |
|
"eval_batch_size": "8", |
|
"fp16": "true", |
|
"learning_rate": "1e-05", |
|
} |
|
|
|
## Test results |
|
|
|
- Accuracy = 0.9148 |
|
|
|
|
|
## Model in action |
|
|
|
### Usage for Image Classification |
|
|
|
```python |
|
from transformers import ViTFeatureExtractor, ViTModel |
|
from PIL import Image |
|
import requests |
|
|
|
url = 'http://images.cocodataset.org/val2017/000000039769.jpg' |
|
image = Image.open(requests.get(url, stream=True).raw) |
|
|
|
feature_extractor = ViTFeatureExtractor.from_pretrained('google/vit-base-patch16-224-in21k') |
|
model = ViTModel.from_pretrained('edumunozsala/vit_base-224-in21k-ft-cifar100') |
|
inputs = feature_extractor(images=image, return_tensors="pt") |
|
|
|
outputs = model(**inputs) |
|
last_hidden_states = outputs.last_hidden_state |
|
``` |
|
|
|
Created by [Eduardo Muñoz/@edumunozsala](https://github.com/edumunozsala) |
|
|