zR commited on
Commit
0de2cad
·
1 Parent(s): d472c0c
Files changed (2) hide show
  1. README.md +17 -105
  2. README_zh.md +19 -114
README.md CHANGED
@@ -43,124 +43,36 @@ Please refer to our [GitHub](https://github.com/THUDM/CogAgent) for specific exa
43
  conversations but does support continuous execution history. Below are guidelines on how users should format their input
44
  for the model and interpret the formatted output.
45
 
46
- ### User Input
47
-
48
- 1. **`task` field**
49
- A task description provided by the user, similar to a textual prompt. The input should be concise and clear to guide
50
- the `CogAgent-9B-20241220` model to complete the task.
51
-
52
- 2. **`platform` field**
53
- `CogAgent-9B-20241220` supports operation on several platforms with GUI interfaces:
54
- - **Windows**: Use the `WIN` field for Windows 10 or 11.
55
- - **Mac**: Use the `MAC` field for Mac 14 or 15.
56
- - **Mobile**: Use the `Mobile` field for Android 13, 14, 15, or similar Android-based UI versions.
57
-
58
- If using other systems, results may vary. Use the `Mobile` field for mobile devices, `WIN` for Windows, and `MAC` for
59
- Mac.
60
-
61
- 3. **`format` field**
62
- Specifies the desired format of the returned data. Options include:
63
- - `Answer in Action-Operation-Sensitive format.`: The default format in our demo, returning actions, corresponding
64
- operations, and sensitivity levels.
65
- - `Answer in Status-Plan-Action-Operation format.`: Returns status, plans, actions, and corresponding operations.
66
- - `Answer in Status-Action-Operation-Sensitive format.`: Returns status, actions, corresponding operations, and
67
- sensitivity levels.
68
- - `Answer in Status-Action-Operation format.`: Returns status, actions, and corresponding operations.
69
- - `Answer in Action-Operation format.`: Returns actions and corresponding operations.
70
-
71
- 4. **`history` field**
72
- The input should be concatenated in the following order:
73
- ```
74
- query = f'{task}{history}{platform}{format}'
75
- ```
76
-
77
- ### Model Output
78
-
79
- 1. **Sensitive Operations**: Includes types like `<<Sensitive Operation>>` or `<<General Operation>>`, returned only
80
- when `Sensitive` is requested.
81
- 2. **`Plan`, `Agent`, `Status`, `Action` fields**: Describe the model's behavior and operations, returned based on the
82
- requested format.
83
- 3. **General Responses**: Summarizes the output before formatting.
84
- 4. **`Grounded Operation` field**: Describes the model's specific actions, such as coordinates, element types, and
85
- descriptions. Actions include:
86
- - `CLICK`: Simulates mouse clicks or touch gestures.
87
- - `LONGPRESS`: Simulates long presses (supported only in `Mobile` mode).
88
-
89
- ### Example
90
-
91
- If the user wants to mark all emails as read on a Mac system and requests an `Action-Operation-Sensitive` format, the
92
- prompt should be:
93
 
94
- ```
95
- Task: Mark all emails as read
96
- (Platform: Mac)
97
- (Answer in Action-Operation-Sensitive format.)
98
- ```
99
-
100
- Below are examples of model responses based on different requested formats:
101
-
102
- <details>
103
- <summary>Answer in Action-Operation-Sensitive format</summary>
104
-
105
- ```
106
- Action: Click the "Mark All as Read" button at the top toolbar to mark all emails as read.
107
- Grounded Operation: CLICK(box=[[219,186,311,207]], element_type='Clickable Text', element_info='Mark All as Read')
108
- <<General Operation>>
109
- ```
110
-
111
- </details>
112
-
113
- <details>
114
- <summary>Answer in Status-Plan-Action-Operation format</summary>
115
 
116
- ```
117
- Status: None
118
- Plan: None
119
- Action: Click the "Mark All as Read" button at the top center of the inbox page to mark all emails as read.
120
- Grounded Operation: CLICK(box=[[219,186,311,207]], element_type='Clickable Text', element_info='Mark All as Read')
121
- ```
122
 
123
- </details>
 
 
124
 
125
- <details>
126
- <summary>Answer in Status-Action-Operation-Sensitive format</summary>
127
 
128
  ```
129
- Status: Currently on the email interface [[0, 2, 998, 905]], with email categories on the left [[1, 216, 144, 570]] and the inbox in the center [[144, 216, 998, 903]]. The "Mark All as Read" button [[223, 178, 311, 210]] has been clicked.
130
- Action: Click the "Mark All as Read" button at the top toolbar to mark all emails as read.
131
- Grounded Operation: CLICK(box=[[219,186,311,207]], element_type='Clickable Text', element_info='Mark All as Read')
132
- <<General Operation>>
133
- ```
134
-
135
- </details>
136
-
137
- <details>
138
- <summary>Answer in Status-Action-Operation format</summary>
139
 
 
140
  ```
141
- Status: None
142
- Action: On the inbox page, click the "Mark All as Read" button to mark all emails as read.
143
- Grounded Operation: CLICK(box=[[219,186,311,207]], element_type='Clickable Text', element_info='Mark All as Read')
144
  ```
145
-
146
- </details>
147
-
148
- <details>
149
- <summary>Answer in Action-Operation format</summary>
150
 
151
  ```
152
- Action: Right-click on the first email in the left-side list to open the action menu.
153
- Grounded Operation: RIGHT_CLICK(box=[[154,275,343,341]], element_info='[AXCell]')
154
  ```
155
 
156
- </details>
157
-
158
- ### Notes
159
-
160
- 1. This model is not a conversational model and does not support continuous dialogue. Please send specific instructions
161
- and use the suggested concatenation method.
162
- 2. Images must be provided as input; textual prompts alone cannot execute GUI agent tasks.
163
- 3. The model outputs strictly formatted STR data and does not support JSON format.
164
 
165
  ## Previous Work
166
 
 
43
  conversations but does support continuous execution history. Below are guidelines on how users should format their input
44
  for the model and interpret the formatted output.
45
 
46
+ ## Run the Model
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47
 
48
+ <p>Please visit our <a href="https://github.com/THUDM/CogAgent">github</a> for specific running examples, as well as the part for prompt concatenation <strong style="color: red;">(this directly affects whether the model runs correctly)</strong>.</p>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
 
50
+ In particular, pay attention to the prompt concatenation process.
51
+ You can refer to [app/client.py#L115](https://github.com/THUDM/CogAgent/blob/e3ca6f4dc94118d3dfb749f195cbb800ee4543ce/app/client.py#L115) for concatenating user input prompts.
52
+ ``` python
53
+ current_platform = identify_os() # "Mac" or "WIN" or "Mobile",注意大小写
54
+ platform_str = f"(Platform: {current_platform})\n"
55
+ format_str = "(Answer in Action-Operation-Sensitive format.)\n" # You can use other format to replace "Action-Operation-Sensitive"
56
 
57
+ history_str = "\nHistory steps: "
58
+ for index, (grounded_op_func, action) in enumerate(zip(history_grounded_op_funcs, history_actions)):
59
+ history_str += f"\n{index}. {grounded_op_func}\t{action}" # start from 0.
60
 
61
+ query = f"Task: {task}{history_str}\n{platform_str}{format_str}" # Be careful about the \n
 
62
 
63
  ```
 
 
 
 
 
 
 
 
 
 
64
 
65
+ A minimal user input concatenation code is as follows:
66
  ```
67
+ "Task: Search for doors, click doors on sale and filter by brands \"Mastercraft\".\nHistory steps: \n0. CLICK(box=[[352,102,786,139]], element_info='Search')\tLeft click on the search box located in the middle top of the screen next to the Menards logo.\n1. TYPE(box=[[352,102,786,139]], text='doors', element_info='Search')\tIn the search input box at the top, type 'doors'.\n2. CLICK(box=[[787,102,809,139]], element_info='SEARCH')\tLeft click on the magnifying glass icon next to the search bar to perform the search.\n3. SCROLL_DOWN(box=[[0,209,998,952]], step_count=5, element_info='[None]')\tScroll down the page to see the available doors.\n4. CLICK(box=[[280,708,710,809]], element_info='Doors on Sale')\tClick the \"Doors On Sale\" button in the middle of the page to view the doors that are currently on sale.\n(Platform: WIN)\n(Answer in Action-Operation format.)\n"
 
 
68
  ```
69
+ The concatenated Python string will look like:
 
 
 
 
70
 
71
  ```
72
+ "Task: Search for doors, click doors on sale and filter by brands \"Mastercraft\".\nHistory steps: \n0. CLICK(box=[[352,102,786,139]], element_info='Search')\tLeft click on the search box located in the middle top of the screen next to the Menards logo.\n1. TYPE(box=[[352,102,786,139]], text='doors', element_info='Search')\tIn the search input box at the top, type 'doors'.\n2. CLICK(box=[[787,102,809,139]], element_info='SEARCH')\tLeft click on the magnifying glass icon next to the search bar to perform the search.\n3. SCROLL_DOWN(box=[[0,209,998,952]], step_count=5, element_info='[None]')\tScroll down the page to see the available doors.\n4. CLICK(box=[[280,708,710,809]], element_info='Doors on Sale')\tClick the \"Doors On Sale\" button in the middle of the page to view the doors that are currently on sale.\n(Platform: WIN)\n(Answer in Action-Operation format.)\n"
 
73
  ```
74
 
75
+ Due to the length, if you would like to understand the meaning and representation of each field in detail, please refer to the [GitHub](https://github.com/THUDM/CogAgent).
 
 
 
 
 
 
 
76
 
77
  ## Previous Work
78
 
README_zh.md CHANGED
@@ -14,137 +14,42 @@
14
  双语开源VLM基座模型,通过数据的采集与优化、多阶段训练与策略改进等方法,`CogAgent-9B-20241220` 在GUI
15
  感知、推理预测准确性、动作空间完善性、任务的普适和泛化性上得到了大幅提升,能够接受中英文双语的屏幕截图和语言交互。
16
  此版CogAgent模型已被应用于智谱AI的 [GLM-PC产品](https://cogagent.aminer.cn/home)
17
- 。我们希望这版模型的发布能够帮助到学术研究者们和开发者们,一起推进基于视觉语言模型的 GUI agent 的研究和应用。
18
 
19
  ## 运行模型
20
 
21
- 请前往我们的[github](https://github.com/THUDM/CogAgent) 查看具体的运行示例。
22
 
23
- ## 输入和输出
 
 
 
24
 
25
- cogagent-9b-20241220是一个Agent类执行模型而非对话模型,不支持连续对话,但是但支持连续的执行历史。
26
- 这里展示了用户应该怎么整理自己的输入格式化的传入给模型。并获得模型规则的回复。
 
27
 
28
- ### 用户输入部分
 
 
29
 
30
- 1. `task` 字段
31
-
32
- 用户输入的任务描述,类似文本格式的prompt,该输入可以指导 CogAgent1.5 模型完成用户任务指令。请保证简洁明了。
33
-
34
- 2. `platform` 字段
35
-
36
- CogAgent1.5 支持在多个平台上执行可操作Agent功能, 我们支持的带有图形界面的操作系统有三个系统,
37
- - Windows 10,11,请使用 `WIN` 字段。
38
- - Mac 14,15,请使用 `MAC` 字段。
39
- - Android 13,14,15 以及其他GUI和UI操作方式几乎相同的安卓UI发行版,请使用 `Mobile` 字段。
40
-
41
- 如果您使用的是其他系统,效果可能不佳,但可以尝试使用 `Mobile` 字段用于手机设备,`WIN` 字段用于Windows设备,`MAC`
42
- 字段用于Mac设备。
43
-
44
- 3. `format` 字段
45
-
46
- 用户希望 CogAgent1.5 返回何种格式的数据, 这里有以下几种选项:
47
- - `Answer in Action-Operation-Sensitive format.`: 本仓库中demo默认使用的返回方式,返回模型的行为,对应的操作,以及对应的敏感程度。
48
- - `Answer in Status-Plan-Action-Operation format.`: 返回模型的装题,行为,以及相应的操作。
49
- - `Answer in Status-Action-Operation-Sensitive format.`: 返回模型的状态,行为,对应的操作,以及对应的敏感程度。
50
- - `Answer in Status-Action-Operation format.`: 返回模型的状态,行为。
51
- - `Answer in Action-Operation format.` 返回模型的行为,对应的操作。
52
-
53
- 4. `history` 字段
54
-
55
- 拼接顺序和结果应该如下所示:
56
- ```
57
- query = f'{task}{history}{platform}{format}'
58
- ```
59
-
60
- ### 模型返回部分
61
-
62
- 1. 敏感操作: 包括 `<<敏感操作>> <<一般操作>>` 几种类型,只有要求返回`Sensitive`的时候返回。
63
- 2. `Plan`, `Agent`, `Status`, `Action` 字段: 用于描述模型的行为和操作。只有要求返回对应字段的时候返回,例如带有`Action`则返回
64
- `Action`字段内容。
65
- 3. 常规回答部分,这部分回答会在格式化回答之前,表示综述。
66
- 4. `Grounded Operation` 字段:
67
- 用于描述模型的具体操作,包括操作的位置,类型,以及具体的操作内容。其中 `box` 代表执行区域的坐标,`element_type` 代表执行的元素类型,
68
- `element_info` 代表执行的元素描述。这些信息被一个 `操作指令` 操作所包裹。这些指令包括:
69
- - `CLICK`: 点击操作,模拟鼠标点击或者手指触摸。
70
- - `LONGPRESS`: 长案操作。仅在 `Mobile` 模式下支持。
71
-
72
- ### 例子
73
-
74
- 用户的任务是希望帮忙将所有邮件标记为已读,用户使用的是 Mac系统,希望返回的是Action-Operation-Sensitive格式。
75
- 正确拼接后的提示词应该为:
76
 
77
  ```
78
- Task: 帮我将所有的邮件标注为已读
79
- (Platform: Mac)
80
- (Answer in Action-Operation-Sensitive format.)
81
- ```
82
-
83
- 接着,这里展现了不同格式要求下的返���结果:
84
 
85
-
86
- <details>
87
- <summary>Answer in Action-Operation-Sensitive format</summary>
88
 
89
  ```
90
- Action: 点击页面顶部工具栏中的“全部标为已读”按钮,将所有邮件标记为已读。
91
- Grounded Operation: CLICK(box=[[219,186,311,207]], element_type='可点击文本', element_info='全部标为已读')
92
- <<一般操作>>
93
  ```
94
 
95
- </details>
96
-
97
- <details>
98
- <summary>Answer in Status-Plan-Action-Operation format</summary>
99
 
 
 
100
  ```
101
- Status: None
102
- Plan: None.
103
- Action: 点击收件箱页面顶部中间的“全部标记为已读”按钮,将所有邮件标记为已读。
104
- Grounded Operation: CLICK(box=[[219,186,311,207]], element_type='可点击文本', element_info='全部标为已读')
105
- ```
106
-
107
- </details>
108
-
109
- <details>
110
- <summary>Answer in Status-Action-Operation-Sensitive format</summary>
111
-
112
- ```
113
- Status: 当前处于邮箱界面[[0, 2, 998, 905]],左侧是邮箱分类[[1, 216, 144, 570]],中间是收件箱[[144, 216, 998, 903]],已经点击“全部标为已读”按钮[[223, 178, 311, 210]]。
114
- Action: 点击页面顶部工具栏中的“全部标为已读”按钮,将所有邮件标记为已读。
115
- Grounded Operation: CLICK(box=[[219,186,311,207]], element_type='可点击文本', element_info='全部标为已读')
116
- <<一般操作>>
117
- ```
118
-
119
- </details>
120
-
121
- <details>
122
- <summary>Answer in Status-Action-Operation format</summary>
123
-
124
- ```
125
- Status: None
126
- Action: 在收件箱页面顶部,点击“全部标记为已读”按钮,将所有邮件标记为已读。
127
- Grounded Operation: CLICK(box=[[219,186,311,207]], element_type='可点击文本', element_info='全部标为已读')
128
- ```
129
-
130
- </details>
131
-
132
- <details>
133
- <summary>Answer in Action-Operation format</summary>
134
-
135
- ```
136
- Action: 在左侧邮件列表中,右键单击第一封邮件,以打开操作菜单。
137
- Grounded Operation: RIGHT_CLICK(box=[[154,275,343,341]], element_info='[AXCell]')
138
- ```
139
-
140
- </details>
141
-
142
- ### 注意事项
143
-
144
- 1. 该模型不是对话模型,不支持连续对话,请发送具体指令,并参考我们提供的历史拼接方式进行拼接。
145
- 2. 该模型必须要有图片传入,纯文字对话无法实现GUI Agent任务。
146
- 3. 该模型输出有严格的格式要求,请严格按照我们的要求进行解析。输出格式为 STR 格式,不支持输出JSON 格式。
147
 
 
148
 
149
  ## 先前的工作
150
 
 
14
  双语开源VLM基座模型,通过数据的采集与优化、多阶段训练与策略改进等方法,`CogAgent-9B-20241220` 在GUI
15
  感知、推理预测准确性、动作空间完善性、任务的普适和泛化性上得到了大幅提升,能够接受中英文双语的屏幕截图和语言交互。
16
  此版CogAgent模型已被应用于智谱AI的 [GLM-PC产品](https://cogagent.aminer.cn/home)
17
+ 。我们希望这版模型的发布能够帮助到学术研究者们和开发者们,一起推进基于视觉语言往我们的模型的 GUI agent 的研究和应用。
18
 
19
  ## 运行模型
20
 
21
+ <p>请前往我们的 <a href="https://github.com/THUDM/CogAgent">github</a> 查看具体的运行示例,以及模型提示词拼接部分 <strong style="color: red;">(这直接影响模型是否正常运行)</strong>。</p>
22
 
23
+ 其中,特别注意提示词拼接过程。
24
+ 您可以参考 [app/client.py#L115](https://github.com/THUDM/CogAgent/blob/e3ca6f4dc94118d3dfb749f195cbb800ee4543ce/app/client.py#L115)
25
+ 拼接用户输入提示词。
26
+ ``` python
27
 
28
+ current_platform = identify_os() # "Mac" or "WIN" or "Mobile",注意大小写
29
+ platform_str = f"(Platform: {current_platform})\n"
30
+ format_str = "(Answer in Action-Operation-Sensitive format.)\n" # You can use other format to replace "Action-Operation-Sensitive"
31
 
32
+ history_str = "\nHistory steps: "
33
+ for index, (grounded_op_func, action) in enumerate(zip(history_grounded_op_funcs, history_actions)):
34
+ history_str += f"\n{index}. {grounded_op_func}\t{action}" # start from 0.
35
 
36
+ query = f"Task: {task}{history_str}\n{platform_str}{format_str}" # Be careful about the \n
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37
 
38
  ```
 
 
 
 
 
 
39
 
40
+ 一个最简用户输入拼接代码如下所示:
 
 
41
 
42
  ```
43
+ "Task: Search for doors, click doors on sale and filter by brands \"Mastercraft\".\nHistory steps: \n0. CLICK(box=[[352,102,786,139]], element_info='Search')\tLeft click on the search box located in the middle top of the screen next to the Menards logo.\n1. TYPE(box=[[352,102,786,139]], text='doors', element_info='Search')\tIn the search input box at the top, type 'doors'.\n2. CLICK(box=[[787,102,809,139]], element_info='SEARCH')\tLeft click on the magnifying glass icon next to the search bar to perform the search.\n3. SCROLL_DOWN(box=[[0,209,998,952]], step_count=5, element_info='[None]')\tScroll down the page to see the available doors.\n4. CLICK(box=[[280,708,710,809]], element_info='Doors on Sale')\tClick the \"Doors On Sale\" button in the middle of the page to view the doors that are currently on sale.\n(Platform: WIN)\n(Answer in Action-Operation format.)\n"
 
 
44
  ```
45
 
46
+ 拼接后的python字符串形如:
 
 
 
47
 
48
+ ``` python
49
+ "Task: Search for doors, click doors on sale and filter by brands \"Mastercraft\".\nHistory steps: \n0. CLICK(box=[[352,102,786,139]], element_info='Search')\tLeft click on the search box located in the middle top of the screen next to the Menards logo.\n1. TYPE(box=[[352,102,786,139]], text='doors', element_info='Search')\tIn the search input box at the top, type 'doors'.\n2. CLICK(box=[[787,102,809,139]], element_info='SEARCH')\tLeft click on the magnifying glass icon next to the search bar to perform the search.\n3. SCROLL_DOWN(box=[[0,209,998,952]], step_count=5, element_info='[None]')\tScroll down the page to see the available doors.\n4. CLICK(box=[[280,708,710,809]], element_info='Doors on Sale')\tClick the \"Doors On Sale\" button in the middle of the page to view the doors that are currently on sale.\n(Platform: WIN)\n(Answer in Action-Operation format.)\n"
50
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51
 
52
+ 由于篇幅较长,若您想仔细了解每个字段的含义和表示,请参考[github](https://github.com/THUDM/CogAgent)。
53
 
54
  ## 先前的工作
55