Image-Text-to-Text
Safetensors
llava_llama

UGround (The Initial LLaVA-based Version)

UGround is a storng GUI visual grounding model trained with a simple recipe. Check our homepage and paper for more details. radar

Models

Release Plan

  • Model Weights
    • Initial V1 (the one used in the paper)
    • Qwen2-VL-based V1
      • 2B
      • 7B
      • 72B
    • V1.1
  • Code
    • Inference Code of UGround
    • Offline Experiments
      • Screenspot (along with referring expressions generated by GPT-4/4o)
      • Multimodal-Mind2Web
      • OmniAct
      • Android Control
    • Online Experiments
      • Mind2Web-Live-SeeAct-V
      • AndroidWorld-SeeAct-V
  • Data-V1
    • Data Examples
    • Data Construction Scripts
    • Guidance of Open-source Data
  • Data-V1.1
  • Online Demo (HF Spaces)

Main Results

GUI Visual Grounding: ScreenSpot (Standard Setting)

Grounding Model Arch SFT data Mobile-Text Mobile-Icon Desktop-Text Desktop-Icon Web-Text Web-Icon Avg
GPT-4 22.6 24.5 20.2 11.8 9.2 8.8 16.2
GPT-4o 20.2 24.9 21.1 23.6 12.2 7.8 18.3
MiniGPT-v2 MiniGPT-v2 8.4 6.6 6.2 2.9 6.5 3.4 5.7
Groma Groma 10.3 2.6 4.6 4.3 5.7 3.4 5.2
Fuyu Fuyu 41.0 1.3 33.0 3.6 33.9 4.4 19.5
Qwen-VL Qwen-VL 9.5 4.8 5.7 5.0 3.5 2.4 5.2
SeeClick Qwen-VL SeeClick 78.0 52.0 72.2 30.0 55.7 32.5 53.4
Qwen-GUI Qwen-VL GUICourse 52.4 10.9 45.9 5.7 43.0 13.6 28.6
UGround-V1 LLaVA-UGround-V1 UGround-V1 82.8 60.3 82.5 63.6 80.4 70.4 73.3
Qwen2-VL Qwen2-VL 61.3 39.3 52.0 45.0 33.0 21.8 42.1
Auguvis-G-7B Qwen2-VL Aguvis-Stage-1 88.3 78.2 88.1 70.7 85.7 74.8 81.0
Auguvis-7B Qwen2-VL Aguvis-Stage-1&2 95.6 77.7 93.8 67.1 88.3 75.2 83.0
OS-Atlas-Base-4B InternVL OS-Atlas 85.7 58.5 72.2 45.7 82.6 63.1 68.0
OS-Atlas-Base-7B Qwen2-VL OS-Atlas 93.0 72.9 91.8 62.9 90.9 74.3 81.0
ShowUI-G ShowUI ShowUI 91.6 69.0 81.8 59.0 83.0 65.5 75.0
ShowUI ShowUI ShowUI 92.3 75.5 76.3 61.1 81.7 63.6 75.1
Iris Iris SeeClick 85.3 64.2 86.7 57.5 82.6 71.2 74.6
Aria-UI Aria Aria-UI 92.3 73.8 93.3 64.3 86.5 76.2 81.1
UGround-V1-2B (Qwen2-VL) Qwen2-VL UGround-V1 89.4 72.0 88.7 65.7 81.3 68.9 77.7
UGround-V1-7B (Qwen2-VL) Qwen2-VL UGround-V1 93.0 79.9 93.8 76.4 90.9 84.0 86.3

GUI Visual Grounding: ScreenSpot (Agent Setting)

Planner Grounding Model Arch SFT data Mobile-Text Mobile-Icon Desktop-Text Desktop-Icon Web-Text Web-Icon Avg
GPT-4o Qwen-VL Qwen-VL 21.3 21.4 18.6 10.7 9.1 5.8 14.5
GPT-4o SeeClick Qwen-VL SeeClick 81.0 59.8 69.6 33.6 43.9 26.2 52.4
GPT-4o Qwen-GUI Qwen-VL GUICourse 67.8 24.5 53.1 16.4 50.4 18.5 38.5
GPT-4o UGround-V1 LLaVA-UGround-V1 UGround-V1 93.4 76.9 92.8 67.9 88.7 68.9 81.4
GPT-4o OS-Atlas-Base-4B InternVL OS-Atlas 94.1 73.8 77.8 47.1 86.5 65.3 74.1
GPT-4o OS-Atlas-Base-7B Qwen2-VL OS-Atlas 93.8 79.9 90.2 66.4 92.6 79.1 83.7
GPT-4o UGround-V1-2B (Qwen2-VL) Qwen2-VL UGround-V1 94.1 77.7 92.8 63.6 90.0 70.9 81.5
GPT-4o UGround-V1-7B (Qwen2-VL) Qwen2-VL UGround-V1 94.1 79.9 93.3 73.6 89.6 73.3 84.0

image/png

Citation Information

If you find this work useful, please consider citing our papers:

@article{gou2024uground,
        title={Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents},
        author={Boyu Gou and Ruohan Wang and Boyuan Zheng and Yanan Xie and Cheng Chang and Yiheng Shu and Huan Sun and Yu Su},
        journal={arXiv preprint arXiv:2410.05243},
        year={2024},
        url={https://arxiv.org/abs/2410.05243},
      }

@article{zheng2023seeact,
        title={GPT-4V(ision) is a Generalist Web Agent, if Grounded},
        author={Boyuan Zheng and Boyu Gou and Jihyung Kil and Huan Sun and Yu Su},
        journal={arXiv preprint arXiv:2401.01614},
        year={2024},
      }
Downloads last month
16,980
Safetensors
Model size
7.06B params
Tensor type
FP16
Β·
Inference API
Unable to determine this model's library. Check the docs .

Spaces using osunlp/UGround 2

Collection including osunlp/UGround