STEM-AI-mtl
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -1,15 +1,17 @@
|
|
1 |
---
|
2 |
license: other
|
|
|
|
|
3 |
tags:
|
4 |
- vision
|
5 |
- image-classification
|
6 |
- STEM-AI-mtl/City_map
|
7 |
- Google
|
8 |
- ViT
|
|
|
9 |
datasets:
|
10 |
- STEM-AI-mtl/City_map
|
11 |
-
|
12 |
-
license_link: LICENSE
|
13 |
widget:
|
14 |
- src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/tiger.jpg
|
15 |
example_title: Tiger
|
@@ -19,11 +21,9 @@ widget:
|
|
19 |
example_title: Palace
|
20 |
---
|
21 |
|
22 |
-
#
|
23 |
-
|
24 |
-
Vision Transformer (ViT) model pre-trained on ImageNet-21k (14 million images, 21,843 classes) at resolution 224x224, and fine-tuned on ImageNet 2012 (1 million images, 1,000 classes) at resolution 224x224. It was introduced in the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Dosovitskiy et al. and first released in [this repository](https://github.com/google-research/vision_transformer). However, the weights were converted from the [timm repository](https://github.com/rwightman/pytorch-image-models) by Ross Wightman, who already converted the weights from JAX to PyTorch. Credits go to him.
|
25 |
|
26 |
-
|
27 |
|
28 |
## Model description
|
29 |
|
@@ -33,10 +33,6 @@ Images are presented to the model as a sequence of fixed-size patches (resolutio
|
|
33 |
|
34 |
By pre-training the model, it learns an inner representation of images that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled images for instance, you can train a standard classifier by placing a linear layer on top of the pre-trained encoder. One typically places a linear layer on top of the [CLS] token, as the last hidden state of this token can be seen as a representation of an entire image.
|
35 |
|
36 |
-
## Intended uses & limitations
|
37 |
-
|
38 |
-
You can use the raw model for image classification. See the [model hub](https://huggingface.co/models?search=google/vit) to look for
|
39 |
-
fine-tuned versions on a task that interests you.
|
40 |
|
41 |
### How to use
|
42 |
|
@@ -47,16 +43,16 @@ from transformers import ViTImageProcessor, ViTForImageClassification
|
|
47 |
from PIL import Image
|
48 |
import requests
|
49 |
|
50 |
-
url = '
|
51 |
image = Image.open(requests.get(url, stream=True).raw)
|
52 |
|
53 |
-
processor = ViTImageProcessor.from_pretrained('
|
54 |
-
model = ViTForImageClassification.from_pretrained('
|
55 |
|
56 |
inputs = processor(images=image, return_tensors="pt")
|
57 |
outputs = model(**inputs)
|
58 |
logits = outputs.logits
|
59 |
-
|
60 |
predicted_class_idx = logits.argmax(-1).item()
|
61 |
print("Predicted class:", model.config.id2label[predicted_class_idx])
|
62 |
```
|
@@ -65,7 +61,7 @@ For more code examples, we refer to the [documentation](https://huggingface.co/t
|
|
65 |
|
66 |
## Training data
|
67 |
|
68 |
-
This ViT model was fine-tuned on the [STEM-AI-mtl/City_map dataset](https://huggingface.co/datasets/STEM-AI-mtl/City_map), contaning
|
69 |
|
70 |
## Training procedure
|
71 |
|
|
|
1 |
---
|
2 |
license: other
|
3 |
+
license_name: stem.ai.mtl
|
4 |
+
license_link: LICENSE
|
5 |
tags:
|
6 |
- vision
|
7 |
- image-classification
|
8 |
- STEM-AI-mtl/City_map
|
9 |
- Google
|
10 |
- ViT
|
11 |
+
- STEM-AI-mtl
|
12 |
datasets:
|
13 |
- STEM-AI-mtl/City_map
|
14 |
+
|
|
|
15 |
widget:
|
16 |
- src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/tiger.jpg
|
17 |
example_title: Tiger
|
|
|
21 |
example_title: Palace
|
22 |
---
|
23 |
|
24 |
+
# The fine-tuned ViT model that beats [Google's base model](https://huggingface.co/google/vit-base-patch16-224)
|
|
|
|
|
25 |
|
26 |
+
Image-classification model that identifies which city map is illustrated from an image input.
|
27 |
|
28 |
## Model description
|
29 |
|
|
|
33 |
|
34 |
By pre-training the model, it learns an inner representation of images that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled images for instance, you can train a standard classifier by placing a linear layer on top of the pre-trained encoder. One typically places a linear layer on top of the [CLS] token, as the last hidden state of this token can be seen as a representation of an entire image.
|
35 |
|
|
|
|
|
|
|
|
|
36 |
|
37 |
### How to use
|
38 |
|
|
|
43 |
from PIL import Image
|
44 |
import requests
|
45 |
|
46 |
+
url = 'https://assets.wfcdn.com/im/16661612/compr-r85/4172/41722749/new-york-city-map-on-paper-print.jpg'
|
47 |
image = Image.open(requests.get(url, stream=True).raw)
|
48 |
|
49 |
+
processor = ViTImageProcessor.from_pretrained('STEM-AI-mtl/City_map-vit-base-patch16-224')
|
50 |
+
model = ViTForImageClassification.from_pretrained('STEM-AI-mtl/City_map-vit-base-patch16-224')
|
51 |
|
52 |
inputs = processor(images=image, return_tensors="pt")
|
53 |
outputs = model(**inputs)
|
54 |
logits = outputs.logits
|
55 |
+
|
56 |
predicted_class_idx = logits.argmax(-1).item()
|
57 |
print("Predicted class:", model.config.id2label[predicted_class_idx])
|
58 |
```
|
|
|
61 |
|
62 |
## Training data
|
63 |
|
64 |
+
This [Google's ViT-base-patch16-224](https://huggingface.co/google/vit-base-patch16-224) model was fine-tuned on the [STEM-AI-mtl/City_map dataset](https://huggingface.co/datasets/STEM-AI-mtl/City_map), contaning overer 600 images of 45 different maps of cities around the world.
|
65 |
|
66 |
## Training procedure
|
67 |
|