Results 291 to 300 of about 769,446 (361)
Some of the next articles are maybe not open access.
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents
International Conference on Learning RepresentationsMultimodal large language models (MLLMs) are transforming the capabilities of graphical user interface (GUI) agents, facilitating their transition from controlled simulations to complex, real-world applications across various platforms.
Boyu Gou +7 more
semanticscholar +1 more source
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
arXiv.orgExisting efforts in building GUI agents heavily rely on the availability of robust commercial Vision-Language Models (VLMs) such as GPT-4o and GeminiProVision. Practitioners are often reluctant to use open-source VLMs due to their significant performance
Zhiyong Wu +10 more
semanticscholar +1 more source
InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners
arXiv.orgMultimodal Large Language Models (MLLMs) have powered Graphical User Interface (GUI) Agents, showing promise in automating tasks on computing devices. Recent works have begun exploring reasoning in GUI tasks with encouraging results.
Yuhang Liu +7 more
semanticscholar +1 more source
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
arXiv.orgThe development of autonomous agents for graphical user interfaces (GUIs) presents major challenges in artificial intelligence. While recent advances in native agent models have shown promise by unifying perception, reasoning, action, and memory through ...
Haoming Wang +105 more
semanticscholar +1 more source
Mobile-Agent-v3: Fundamental Agents for GUI Automation
arXiv.orgThis paper introduces GUI-Owl, a foundational GUI agent model that achieves state-of-the-art performance among open-source end-to-end models on ten GUI benchmarks across desktop and mobile environments, covering grounding, question answering, planning ...
Jiabo Ye +14 more
semanticscholar +1 more source
GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation
arXiv.org, 2023We present MM-Navigator, a GPT-4V-based agent for the smartphone graphical user interface (GUI) navigation task. MM-Navigator can interact with a smartphone screen as human users, and determine subsequent actions to fulfill given instructions.
An Yan +11 more
semanticscholar +1 more source
InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection
arXiv.orgGraphical User Interface (GUI) Agents, powered by multimodal large language models (MLLMs), have shown great potential for task automation on computing devices such as computers and mobile phones.
Yuhang Liu +9 more
semanticscholar +1 more source
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
International Conference on Machine LearningAutomating GUI tasks remains challenging due to reliance on textual representations, platform-specific action spaces, and limited reasoning capabilities.
Yiheng Xu +8 more
semanticscholar +1 more source
GTA1: GUI Test-time Scaling Agent
arXiv.orgGraphical user interface (GUI) agents autonomously complete tasks across platforms (\eg, Linux) by sequentially decomposing user instructions into action proposals that iteratively interact with visual elements in the evolving environment.
Yan Yang +13 more
semanticscholar +1 more source
GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents
arXiv.orgOne of the principal challenges in building VLM-powered GUI agents is visual grounding, i.e., localizing the appropriate screen region for action execution based on both the visual content and the textual plans.
Qianhui Wu +17 more
semanticscholar +1 more source

