File size: 2,406 Bytes
f8216db
 
d17e301
 
 
 
 
 
f8216db
d17e301
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
657cb25
 
2e4268f
657cb25
d17e301
 
 
4077882
d17e301
 
 
 
 
 
 
2c9f637
d17e301
 
 
 
 
 
 
 
 
 
 
4077882
 
 
 
 
 
 
4790821
4077882
 
 
 
7a3ed69
4077882
 
 
 
d17e301
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
---
license: gpl-3.0
tags:
- ui-automation
- automation
- agents
- llm-agents
- vision
---

# Model card for PTA-Text - A *Text Only* Click Model


# Table of Contents

0. [TL;DR](#TL;DR)
1. [Using the model](#running-the-model)
2. [Contribution](#contribution)
3. [Citation](#citation)

# TL;DR

## Details for PTA-Text: 
-> __Input__: An image with a header containing the desired UI click command.

-> __Output__: [x,y] coordinate in relative coordinates 0-1 range.

__PTA-Text__ is an image encoder based on Matcha, which is an extension of Pix2Struct

# Installation

```bash
pip install askui-ml-helper
```

Download the checkpoint ".pt" model from files in this model card.
Or download it from your terminal
```bash
curl -L "https://huggingface.co/AskUI/pta-text-0.1/resolve/main/pta-text-v0.1.1.pt?download=true" -o pta-text-v0.1.1.pt
```

## Running the model

### Get the annotated image

You can run the model in full precision on CPU:
```python
import requests
from PIL import Image
from askui_ml_helper.utils.pta_text import PtaTextInference

pta_text_inference = PtaTextInference("pta-text-v0.1.1.pt")
url = "https://docs.askui.com/assets/images/how_askui_works_architecture-363bc8be35bd228e884c83d15acd19f7.png"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
prompt = 'click on the text "Operating System"'

render_image = pta_text_inference.process_image_and_draw_circle(image, prompt, radius=15)
render_image.show()
>>> Uploaded image with "a red dot", where click operation is predicted 
```

![image/png](/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F5f993a63777efc07d7f1e2ce%2FZNwjdENJqn-1VpXDcm_Wg.png%3C%2Fspan%3E)

### Get the coordinates

```python
import requests
from PIL import Image
from askui_ml_helper.utils.pta_text import PtaTextInference

pta_text_inference = PtaTextInference("pta-text-v0.1.1.pt")
url = "https://docs.askui.com/assets/images/how_askui_works_architecture-363bc8be35bd228e884c83d15acd19f7.png"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
prompt = 'click on the text "Operating System"'

coordinates = pta_text_inference.process_image(image, prompt)
coordinates
>>> [0.3981265723705292, 0.13768285512924194]
```

# Contribution

An AskUI's open source initiative. This model is contributed and added to the Hugging Face ecosystem by [Murali Manohar @ AskUI](https://huggingface.co/gitlost-murali).

# Citation

TODO