Results 291 to 300 of about 769,446 (361)
Some of the next articles are maybe not open access.

Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

International Conference on Learning Representations
Multimodal large language models (MLLMs) are transforming the capabilities of graphical user interface (GUI) agents, facilitating their transition from controlled simulations to complex, real-world applications across various platforms.
Boyu Gou   +7 more
semanticscholar   +1 more source

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

arXiv.org
Existing efforts in building GUI agents heavily rely on the availability of robust commercial Vision-Language Models (VLMs) such as GPT-4o and GeminiProVision. Practitioners are often reluctant to use open-source VLMs due to their significant performance
Zhiyong Wu   +10 more
semanticscholar   +1 more source

InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners

arXiv.org
Multimodal Large Language Models (MLLMs) have powered Graphical User Interface (GUI) Agents, showing promise in automating tasks on computing devices. Recent works have begun exploring reasoning in GUI tasks with encouraging results.
Yuhang Liu   +7 more
semanticscholar   +1 more source

UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

arXiv.org
The development of autonomous agents for graphical user interfaces (GUIs) presents major challenges in artificial intelligence. While recent advances in native agent models have shown promise by unifying perception, reasoning, action, and memory through ...
Haoming Wang   +105 more
semanticscholar   +1 more source

Mobile-Agent-v3: Fundamental Agents for GUI Automation

arXiv.org
This paper introduces GUI-Owl, a foundational GUI agent model that achieves state-of-the-art performance among open-source end-to-end models on ten GUI benchmarks across desktop and mobile environments, covering grounding, question answering, planning ...
Jiabo Ye   +14 more
semanticscholar   +1 more source

GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation

arXiv.org, 2023
We present MM-Navigator, a GPT-4V-based agent for the smartphone graphical user interface (GUI) navigation task. MM-Navigator can interact with a smartphone screen as human users, and determine subsequent actions to fulfill given instructions.
An Yan   +11 more
semanticscholar   +1 more source

InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection

arXiv.org
Graphical User Interface (GUI) Agents, powered by multimodal large language models (MLLMs), have shown great potential for task automation on computing devices such as computers and mobile phones.
Yuhang Liu   +9 more
semanticscholar   +1 more source

Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

International Conference on Machine Learning
Automating GUI tasks remains challenging due to reliance on textual representations, platform-specific action spaces, and limited reasoning capabilities.
Yiheng Xu   +8 more
semanticscholar   +1 more source

GTA1: GUI Test-time Scaling Agent

arXiv.org
Graphical user interface (GUI) agents autonomously complete tasks across platforms (\eg, Linux) by sequentially decomposing user instructions into action proposals that iteratively interact with visual elements in the evolving environment.
Yan Yang   +13 more
semanticscholar   +1 more source

GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents

arXiv.org
One of the principal challenges in building VLM-powered GUI agents is visual grounding, i.e., localizing the appropriate screen region for action execution based on both the visual content and the textual plans.
Qianhui Wu   +17 more
semanticscholar   +1 more source

Home - About - Disclaimer - Privacy