Flash News

Microsoft Open Sources Phi-Ground Model Family, Accurately Solving AI 'Where to Click' Problem

Microsoft has open-sourced the Phi-Ground model series, designed for AI control of computers. By inputting a screenshot and instructions, the model can directly output precise click coordinates.

The open-source version with 400 million parameters, combined with large model instruction planning, achieved a click accuracy rate exceeding that of OpenAI Operator and Claude Computer Use in the Showdown benchmark test, and ranked first in all five assessments below 10 billion parameters, including ScreenSpot-Pro.

Key training findings: treating coordinates as ordinary numerical outputs, placing text instructions before images, and using DPO reinforcement learning to enhance accuracy in visual tasks; for the issue of small elements on 4K high-resolution screens, a reduced white background canvas simulation training was employed, showing significant results in complex software like Photoshop.

Source: Public Information

ABAB AI Insight

Microsoft's Phi team previously launched the Phi-3/Phi-4 series, and this time Phi-Ground focuses on GUI Grounding, validating through over 40 million data points, overturning the previously common academic practice of using special position tokens by directly outputting coordinates as numbers, significantly enhancing stability.

On the capital path, Microsoft is concentrating resources on unified modeling of vision-language-action, reducing developer access costs by open-sourcing the 4B version, while leveraging its own large model planning capabilities to form an end-to-end computer control Agent closed loop, accelerating the transition from chat AI to a truly autonomous computer operation Agent.

Similar to the paths of Anthropic Claude Computer Use and OpenAI Operator, Phi-Ground is currently in the early expansion stage of transitioning GUI Agents from closed demonstrations to high-performance open-source implementations, with hardware-software collaborative training (especially high-resolution screen simulation) becoming a core differentiator.

Structural judgment: Essentially a technological replacement. Phi-Ground allows AI to directly replace human mouse operations through 'text before image + pure numerical coordinates + DPO visual enhancement', addressing the long-standing bottleneck of visual localization, pushing capital and developers to concentrate from general large models to specialized GUI Agent toolchains, accelerating AI's comprehensive replacement of desktop productivity.

ABAB News · Cognitive Law

Coordinates are not a special language, just ordinary numbers; the simpler, the more stable.
Looking at instructions before images is the true AI that can 'find things'.
Whoever trains the AI mouse to be more accurate than humans will master the pricing power of the next generation of human-computer interaction.

Source

·ABAB News
·
3 min read
·4d ago
分享: