Logan Kilpatrick: AI Application Companies Should Build Their Own Benchmarks
Logan Kilpatrick, head of products at OpenAI, stated that every company built on AI should establish its own exclusive benchmarks.
He believes this is an effective way to ensure that model advancements disproportionately benefit their own company.
In terms of market mechanisms, AI application companies are increasing resource investments to create their own evaluation systems, with event-driven funding and talent shifting towards optimizing internal benchmarks. Companies that establish differentiated moats will benefit, while those relying on public benchmark general model suppliers may face short-term pressure.
Source: Public Information
ABAB AI Insight
Logan Kilpatrick previously led developer products and API ecosystems at OpenAI, promoting the implementation of ChatGPT plugins and custom GPTs, and has repeatedly emphasized that enterprise clients need to achieve differentiation through private data and exclusive evaluations. He has helped several Fortune 500 companies build internal GPT benchmarks in 2023-2024.
His capital strategy involves mobilizing OpenAI resources towards "enterprise customization," allowing clients to invest engineering manpower to build benchmarks through APIs and fine-tuning interfaces. The motivation is to deeply bind large clients to the OpenAI platform rather than allowing easy switching, similar to how the enterprise version of ChatGPT Enterprise quickly accumulated high-paying users.
Similar cases include early SaaS companies like Notion and Airtable building internal productivity benchmarks to drive iteration, and Stripe surpassing general solutions through its own payment benchmarks. Kilpatrick currently places the AI application layer at a critical stage of transitioning from reliance on general models to vertical control.
Essentially, this represents a transfer of pricing power: public benchmarks (such as MMLU, HumanEval) lead to homogeneous competition among all companies. Once enterprises build domain-specific benchmarks, they can optimize prompts, fine-tuning, and RAG precisely for their business pain points, resulting in more model advancement benefits remaining within private closed loops. The mechanism is that data and evaluation barriers will reprice the marginal returns of general AI, concentrating them towards companies that establish exclusive flywheels.