--- license: other language: - zh - en base_model: - THUDM/glm-4v-9b pipeline_tag: image-text-to-text library_name: transformers --- # CogAgent

🌐 Github | πŸ€— Huggingface Space | πŸ“„ Technical Report | πŸ“œ arxiv paper

[δΈ­ζ–‡ι˜…θ―»](README_zh.md) ## About the Model The `CogAgent-9B-20241220` model is based on [GLM-4V-9B](https://huggingface.co/THUDM/glm-4v-9b), a bilingual open-source VLM base model. Through data collection and optimization, multi-stage training, and strategy improvements, `CogAgent-9B-20241220` achieves significant advancements in GUI perception, inference prediction accuracy, action space completeness, and task generalizability. The model supports bilingual (Chinese and English) interaction with both screenshots and language input. This version of the CogAgent model has already been applied in ZhipuAI's [GLM-PC product](https://cogagent.aminer.cn/home). We hope this release will assist researchers and developers in advancing the research and applications of GUI agents based on vision-language models. ## Running the Model Please refer to our [GitHub](https://github.com/THUDM/CogAgent) for specific examples of running the model. ## Input and Output `cogagent-9b-20241220` is an agent execution model rather than a conversational model. It does not support continuous conversations but does support continuous execution history. Below are guidelines on how users should format their input for the model and interpret the formatted output. ## Run the Model

Please visit our github for specific running examples, as well as the part for prompt concatenation (this directly affects whether the model runs correctly).

In particular, pay attention to the prompt concatenation process. You can refer to [app/client.py#L115](https://github.com/THUDM/CogAgent/blob/e3ca6f4dc94118d3dfb749f195cbb800ee4543ce/app/client.py#L115) for concatenating user input prompts. ``` python current_platform = identify_os() # "Mac" or "WIN" or "Mobile"οΌŒζ³¨ζ„ε€§ε°ε†™ platform_str = f"(Platform: {current_platform})\n" format_str = "(Answer in Action-Operation-Sensitive format.)\n" # You can use other format to replace "Action-Operation-Sensitive" history_str = "\nHistory steps: " for index, (grounded_op_func, action) in enumerate(zip(history_grounded_op_funcs, history_actions)): history_str += f"\n{index}. {grounded_op_func}\t{action}" # start from 0. query = f"Task: {task}{history_str}\n{platform_str}{format_str}" # Be careful about the \n ``` A minimal user input concatenation code is as follows: ``` "Task: Search for doors, click doors on sale and filter by brands \"Mastercraft\".\nHistory steps: \n0. CLICK(box=[[352,102,786,139]], element_info='Search')\tLeft click on the search box located in the middle top of the screen next to the Menards logo.\n1. TYPE(box=[[352,102,786,139]], text='doors', element_info='Search')\tIn the search input box at the top, type 'doors'.\n2. CLICK(box=[[787,102,809,139]], element_info='SEARCH')\tLeft click on the magnifying glass icon next to the search bar to perform the search.\n3. SCROLL_DOWN(box=[[0,209,998,952]], step_count=5, element_info='[None]')\tScroll down the page to see the available doors.\n4. CLICK(box=[[280,708,710,809]], element_info='Doors on Sale')\tClick the \"Doors On Sale\" button in the middle of the page to view the doors that are currently on sale.\n(Platform: WIN)\n(Answer in Action-Operation format.)\n" ``` The concatenated Python string will look like: ``` "Task: Search for doors, click doors on sale and filter by brands \"Mastercraft\".\nHistory steps: \n0. CLICK(box=[[352,102,786,139]], element_info='Search')\tLeft click on the search box located in the middle top of the screen next to the Menards logo.\n1. TYPE(box=[[352,102,786,139]], text='doors', element_info='Search')\tIn the search input box at the top, type 'doors'.\n2. CLICK(box=[[787,102,809,139]], element_info='SEARCH')\tLeft click on the magnifying glass icon next to the search bar to perform the search.\n3. SCROLL_DOWN(box=[[0,209,998,952]], step_count=5, element_info='[None]')\tScroll down the page to see the available doors.\n4. CLICK(box=[[280,708,710,809]], element_info='Doors on Sale')\tClick the \"Doors On Sale\" button in the middle of the page to view the doors that are currently on sale.\n(Platform: WIN)\n(Answer in Action-Operation format.)\n" ``` Due to the length, if you would like to understand the meaning and representation of each field in detail, please refer to the [GitHub](https://github.com/THUDM/CogAgent). ## Previous Work In November 2023, we released the first generation of CogAgent. You can find related code and weights in the [CogVLM & CogAgent Official Repository](https://github.com/THUDM/CogVLM).

CogVLM

πŸ“– Paper: CogVLM: Visual Expert for Pretrained Language Models

CogVLM is a powerful open-source vision-language model (VLM). CogVLM-17B has 10 billion vision parameters and 7 billion language parameters, supporting 490x490 resolution image understanding and multi-turn conversations.

CogVLM-17B achieved state-of-the-art performance on 10 classic cross-modal benchmarks including NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA, and TDIUC benchmarks.

CogAgent

πŸ“– Paper: CogAgent: A Visual Language Model for GUI Agents

CogAgent is an improved open-source vision-language model based on CogVLM. CogAgent-18B has 11 billion vision parameters and 7 billion language parameters, supporting image understanding at 1120x1120 resolution. Beyond CogVLM's capabilities, it also incorporates GUI agent capabilities.

CogAgent-18B achieved state-of-the-art performance on 9 classic cross-modal benchmarks, including VQAv2, OK-VQ, TextVQA, ST-VQA, ChartQA, infoVQA, DocVQA, MM-Vet, and POPE benchmarks. It significantly outperformed existing models on GUI operation datasets like AITW and Mind2Web.

## License Please follow the [Model License](LICENSE) for using the model weights.