Aakash Gupta: AI is Shifting from Chat Boxes to Screen Perception Interfaces
Aakash Gupta stated that Farza has developed an AI that can see the user's screen directly next to the cursor and point to answers, while Google launched a similar feature at the same time, both betting that chat boxes are just a transitional interface.
Using tools like Cowork, users only need to screenshot a competitor's pricing page and ask questions, and Claude can directly read the layout, identify value proposition hierarchies, upgrade trigger points, and target enterprise customers, with the entire process taking only 45 seconds without the need for text descriptions.
Developers and product teams are accelerating the integration of screen context AI functionalities driven by events, benefiting platforms like Farza, Google, and Cowork from improved interaction efficiency, while AI tools relying on traditional prompt input face pressure to be replaced, with funding shifting towards native AI interfaces with visual perception.
Source: Public Information
ABAB AI Insight
Aakash Gupta has been tracking the evolution of AI product interactions for a long time and has previously analyzed the transition from prompt engineering to multimodal agents. This viewpoint continues his advocacy for the idea that "AI should directly see the user's environment," aligning closely with the simultaneous launch of screen perception tools by Farza and Google.
In terms of capital flow, AI companies are integrating screen sharing and visual understanding capabilities, mobilizing computing resources towards real-time contextual routing, motivated by the aim to eliminate the inefficient step of "users translating screens into text." Strategically, this upgrades AI from a conversational assistant to a parallel work partner, significantly enhancing productivity and expanding enterprise subscription penetration.
Similar to the leap from text chat to GPT-4V image understanding, the current AI interface is in an expansion phase transitioning from chat box dominance to cursor-layer/screen-native interaction. Completing competitive analysis in 45 seconds has become a new benchmark.
Essentially, this is a restructuring of the industry chain driven by technological substitution. The removal of translation steps in screen context changes the pricing power structure of human-computer interaction, with the mechanism being that AI directly "seeing" the user interface greatly reduces cognitive friction, prompting capital to concentrate from pure language models to multimodal visual agents, achieving a structural upgrade of AI from "assisting descriptions" to "direct action" workflows.
ABAB News · Cognitive Law
The more powerful the chat box, the higher the user's translation cost, which will ultimately be replaced by visual alternatives. AI truly begins to work, rather than just converse, when it sees what you see. Every interface that requires description is a hidden native AI opportunity.