STEM-AI-mtl
/

City_map-vit-base-patch16-224

@@ -1,15 +1,17 @@
 ---
 license: other
 tags:
 - vision
 - image-classification
 - STEM-AI-mtl/City_map
 - Google
 - ViT
 datasets:
 - STEM-AI-mtl/City_map
-license_name: stem.ai.mtl
-license_link: LICENSE
 widget:
 - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/tiger.jpg
   example_title: Tiger
@@ -19,11 +21,9 @@ widget:
   example_title: Palace
 ---
-# Vision Transformer (base-sized model)
-Vision Transformer (ViT) model pre-trained on ImageNet-21k (14 million images, 21,843 classes) at resolution 224x224, and fine-tuned on ImageNet 2012 (1 million images, 1,000 classes) at resolution 224x224. It was introduced in the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Dosovitskiy et al. and first released in [this repository](https://github.com/google-research/vision_transformer). However, the weights were converted from the [timm repository](https://github.com/rwightman/pytorch-image-models) by Ross Wightman, who already converted the weights from JAX to PyTorch. Credits go to him.
-Disclaimer: The team releasing ViT did not write a model card for this model so this model card has been written by the Hugging Face team.
 ## Model description
@@ -33,10 +33,6 @@ Images are presented to the model as a sequence of fixed-size patches (resolutio
 By pre-training the model, it learns an inner representation of images that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled images for instance, you can train a standard classifier by placing a linear layer on top of the pre-trained encoder. One typically places a linear layer on top of the [CLS] token, as the last hidden state of this token can be seen as a representation of an entire image.
-## Intended uses & limitations
-You can use the raw model for image classification. See the [model hub](https://huggingface.co/models?search=google/vit) to look for
-fine-tuned versions on a task that interests you.
 ### How to use
@@ -47,16 +43,16 @@ from transformers import ViTImageProcessor, ViTForImageClassification
 from PIL import Image
 import requests
-url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
 image = Image.open(requests.get(url, stream=True).raw)
-processor = ViTImageProcessor.from_pretrained('google/vit-base-patch16-224')
-model = ViTForImageClassification.from_pretrained('google/vit-base-patch16-224')
 inputs = processor(images=image, return_tensors="pt")
 outputs = model(**inputs)
 logits = outputs.logits
-# model predicts one of the 1000 ImageNet classes
 predicted_class_idx = logits.argmax(-1).item()
 print("Predicted class:", model.config.id2label[predicted_class_idx])
 ```
@@ -65,7 +61,7 @@ For more code examples, we refer to the [documentation](https://huggingface.co/t
 ## Training data
-This ViT model was fine-tuned on the [STEM-AI-mtl/City_map dataset](https://huggingface.co/datasets/STEM-AI-mtl/City_map), contaning offer 600 images of 45 maps of cities around the world.
 ## Training procedure

 ---
 license: other
+license_name: stem.ai.mtl
+license_link: LICENSE
 tags:
 - vision
 - image-classification
 - STEM-AI-mtl/City_map
 - Google
 - ViT
+- STEM-AI-mtl
 datasets:
 - STEM-AI-mtl/City_map
 widget:
 - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/tiger.jpg
   example_title: Tiger
   example_title: Palace
 ---
+# The fine-tuned ViT model that beats [Google's base model](https://huggingface.co/google/vit-base-patch16-224)
+Image-classification model that identifies which city map is illustrated from an image input.
 ## Model description
 By pre-training the model, it learns an inner representation of images that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled images for instance, you can train a standard classifier by placing a linear layer on top of the pre-trained encoder. One typically places a linear layer on top of the [CLS] token, as the last hidden state of this token can be seen as a representation of an entire image.
 ### How to use
 from PIL import Image
 import requests
+url = 'https://assets.wfcdn.com/im/16661612/compr-r85/4172/41722749/new-york-city-map-on-paper-print.jpg'
 image = Image.open(requests.get(url, stream=True).raw)
+processor = ViTImageProcessor.from_pretrained('STEM-AI-mtl/City_map-vit-base-patch16-224')
+model = ViTForImageClassification.from_pretrained('STEM-AI-mtl/City_map-vit-base-patch16-224')
 inputs = processor(images=image, return_tensors="pt")
 outputs = model(**inputs)
 logits = outputs.logits
 predicted_class_idx = logits.argmax(-1).item()
 print("Predicted class:", model.config.id2label[predicted_class_idx])
 ```
 ## Training data
+This [Google's ViT-base-patch16-224](https://huggingface.co/google/vit-base-patch16-224) model was fine-tuned on the [STEM-AI-mtl/City_map dataset](https://huggingface.co/datasets/STEM-AI-mtl/City_map), contaning overer 600 images of 45 different maps of cities around the world.
 ## Training procedure