File size: 2,453 Bytes
030b5a2 ca46bec 030b5a2 ca46bec 030b5a2 8a95488 030b5a2 2c70309 030b5a2 65c778f 030b5a2 3c94648 030b5a2 2c70309 030b5a2 65c778f 030b5a2 ca67c85 030b5a2 2c70309 030b5a2 ebaaa6d 030b5a2 65c778f 030b5a2 9429ac5 030b5a2 ca67c85 030b5a2 eb73311 9429ac5 030b5a2 9429ac5 030b5a2 9429ac5 65c778f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 |
---
license: other
license_name: stem.ai.mtl
license_link: LICENSE
tags:
- vision
- image-classification
- STEM-AI-mtl/City_map
- Google
- ViT
- STEM-AI-mtl
datasets:
- STEM-AI-mtl/City_map
---
# The fine-tuned ViT model that beats [Google's state-of-the-art model](https://huggingface.co/google/vit-base-patch16-224) and OpenAI's famous GPT4 for maps of cities around the world
Image-classification fine-tuned model that identifies which city map is illustrated from an image input.
The Vision Transformer (ViT) base model is a transformer encoder model (BERT-like) pretrained on a large collection of images in a supervised fashion, namely ImageNet-21k, at a resolution of 224x224 pixels. Next, the model was fine-tuned on ImageNet (also referred to as ILSVRC2012), a dataset comprising 1 million images and 1,000 classes, also at resolution 224x224.
- **Developed by:** STEM.AI
- **Model type:** Image classification of maps of cities
- **Finetuned from model:** [google/vit-base-patch16-224](https://huggingface.co/google/vit-base-patch16-224)
### How to use:
[Inference script](https://github.com/STEM-ai/Vision/blob/7d92c8daa388eb74e8c336f2d0d3942722fec3c6/ViT_inference.py)
For more code examples, we refer to [ViTdocumentation](https://huggingface.co/transformers/model_doc/vit.html#).
## Training data
This [Google's ViT-base-patch16-224 for city identification](https://huggingface.co/google/vit-base-patch16-224) model was fine-tuned on the [STEM-AI-mtl/City_map dataset](https://huggingface.co/datasets/STEM-AI-mtl/City_map), contaning overer 600 images of 45 different maps of cities around the world.
## Training procedure
A Transformer training was performed on [google/vit-base-patch16-224](https://huggingface.co/google/vit-base-patch16-224) on a 4 Gb Nvidia GTX 1650 GPU.
[Training notebook](https://github.com/STEM-ai/Vision/raw/7d92c8daa388eb74e8c336f2d0d3942722fec3c6/Trainer_ViT.ipynb)
## Training evaluation results
The most accurate output model was obtained from a learning rate of 1e-3. The quality of the training was evaluated with the training dataset and resulted in the following metrics:
{'eval_loss': 1.3691096305847168,\
'eval_accuracy': 0.6666666666666666,\
'eval_runtime': 13.0277,\
'eval_samples_per_second': 4.606,\
'eval_steps_per_second': 0.154,\
'epoch': 2.82}
## Model Card Authors
STEM.AI: [email protected]\
[William Harbec](https://www.linkedin.com/in/william-harbec-56a262248/) |