In-Depth

The Transformer Revolution: How Eight Inventors Rewrote AI Architecture and Power

AI
·
23 min read

Scope of Invention
The first thing to clarify is who “invented the Transformer.” By the strictest public-document standard, the Transformer was not the solo invention of one person. It was a collective invention by the eight coauthors of the 2017 paper Attention Is All You Need: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, Lukasz Kaiser, and Illia Polosukhin. The paper’s footnote explicitly says “Equal contribution. Listing order is random,” and then explains each person’s role in detail. That footnote matters a great deal, because it directly rules out the popular simplification that only the first author should count as the “real” inventor.

The invention did not emerge from nowhere. It appeared at a moment when sequence transduction was hitting real bottlenecks. The dominant approaches at the time were RNNs, LSTMs, GRUs, and encoder-decoder systems augmented with attention, but those systems were either hard to parallelize, inefficient over long dependencies, or limited by long computational paths. What made the Transformer radical was not that it “discovered attention” for the first time; it was that it pushed the design to its logical extreme by removing recurrence and convolution entirely and letting self-attention become the central computational primitive. That made the model far better suited to massively parallel hardware and reshaped how later large models could be trained and scaled.

The original paper’s results were not just interesting; they were decisive. It reported 28.4 BLEU on WMT 2014 English-to-German and 41.0 BLEU on English-to-French, while the larger model trained in 3.5 days on eight P100 GPUs and the base model trained in about 12 hours. So the Transformer was not only more accurate; it was also cheaper to train and easier to scale. The later large-model boom was built first on trainability and systems efficiency, and only then on the visible product layer.

Even its naming and framing carried a “generality ambition” from the beginning. Based on the contemporaneous Google blog post and later media reconstructions, the model was never treated as merely a translation trick. It was framed almost immediately as a general architecture that could transfer across tasks and modalities. In the August 2017 official blog post, the team already highlighted parsing and projected future use in images and video. In other words, the Transformer was born not as a narrow translation model, but as a scalable computational framework for learning.

By 2026, the paper’s citation counts are in the range where databases disagree but all still indicate extraordinary influence. Google Research and the NeurIPS listing show more than 240,000 citations, while Semantic Scholar reports roughly 172,905. The discrepancy reflects database and indexing differences, not disagreement about significance. By any serious measure, it is one of the defining AI papers of the century.

Portraits of the Eight Co-Inventors
Vaswani’s trajectory looks like a classic path from engineer, to foundational researcher, to platform entrepreneur. Public interviews show that he is the son of an architect and a doctor, that he grew up in Oman and later moved to Nagpur at age 15, and that he was influenced both by Indian scientists and by the Microsoft founding story. After studying computer science at BIT Mesra, he worked in IT, then left industry for graduate study at USC, where he completed a master’s and then a PhD in 2014 on statistical machine translation. The decisive intellectual shift in his story was not allegiance to one guru, but his recognition that deep learning was where the next real breakthroughs would happen.

Professionally, Vaswani made his key leap at Google Brain, where he moved from statistical machine translation and NLP into much more general architectural design. The paper’s footnote states that after Jakob proposed the self-attention-over-RNN direction, Ashish and Illia designed and implemented the first Transformer models, and it emphasizes that Ashish was involved in nearly every aspect of the work. After the paper, he co-founded Adept in 2021 and then Essential AI in 2023, where he became CEO. Adept focused on models that take software actions, while Essential emphasizes enterprise AI systems and open-science frontier models.

Shazeer’s public image is that of an unusually strong systems researcher with strong product instincts. Public information about his family background is limited, but his career path is clear: he graduated from Duke, joined Google in 2000, improved spelling correction for search, and later worked on core ad systems. By the time of the Transformer paper, he was already one of the most senior contributors in the group. His own website explicitly credits him with multi-head attention, the residual architecture, and the first superior implementation; Google Research now lists him as Gemini co-tech-lead. He co-founded Character.AI in 2021, then returned through the Google-Character licensing-and-rehire arrangement in 2024, and in 2026 he was elected to the U.S. National Academy of Engineering.

Parmar’s story is almost the opposite of the standard elite academic pathway. She grew up in a lower-middle-class family in Pune; her mother had once wanted to become an architect but could not pursue that path, and that unrealized ambition pushed her to support her daughter’s own. Parmar did not get into IIT, turned instead to self-teaching AI, and when she first arrived in the United States for graduate study, her father and uncle had to borrow money to keep her afloat. Public reports differ on the exact name of her undergraduate institution: NDTV renders it as Pune Institute of Technology, while Forbes India says Pune Institute of Computer Technology. What is clear is that she completed a master’s in computer science at USC from 2013 to 2015 and then joined Google.

Parmar’s role in the invention was far more substantial than the common “third author” shorthand implies. The paper’s footnote says she designed, implemented, tuned, and evaluated “countless model variants” in both the original codebase and Tensor2Tensor. That means she was not merely packaging results or helping with paper writing; she was central to turning an unstable invention into a scalable research program. She joined Google at age 24 as one of the youngest members of the team and one of the only contributors without a PhD, later co-founded Adept, served as its CTO, co-founded Essential, and by 2025 had moved into a technical role at Anthropic. Her long-term importance lies not only in symbolism, but in extending the Transformer into vision, audio, and 3D settings.

Uszkoreit is the person who looks most like the group’s high-level architectural designer. Unlike many AI founders, he came from a household that was already deeply computational and linguistic: in his a16z interview, he says his father was a computer scientist and computational linguist and that dinner-table discussions included Turing machines and finite automata. What matters most is that Google Translate convinced him that machine learning could be both scientifically difficult and immediately product-relevant; that realization pulled him decisively back into Google. Publicly available material is much clearer on his career than on the full details of his degrees, but that career is unmistakable: Google Translate, Google Assistant semantic parsing, Google Brain Berlin, and then Inceptive.

In the original invention, Uszkoreit’s most important role was directional. The paper explicitly says he proposed replacing RNNs with self-attention and started the effort to evaluate the idea. Later, he was also the author of the official Google blog post that introduced the model publicly. Afterward, he carried the same worldview into biology by founding Inceptive, which applies deep learning and experimentation to RNA and what he calls “biological software.” That continuity reveals his structural role: he is not merely an algorithm tinkerer, but someone who repeatedly searches for new domains where the “sequence-representation-generation” logic can dominate.

Public information on Jones’s private background is relatively sparse, but his educational and career path is clear. He comes from a Welsh/U.K. background, completed a BSc in AI and Computer Science and an MSc in Advanced Computer Science at the University of Birmingham, and said in the university’s alumni material that the school’s reputation substantially helped him get into Google even without a referral. Professionally, he spent more than a decade at Google before co-founding Sakana AI with David Ha and Ren Ito and becoming its CTO.

Jones’s contribution to the Transformer was also very concrete. The paper says he handled the initial codebase, efficient inference, visualizations, and ongoing model-variant experimentation. He was the kind of person who helps turn an elegant paper idea into a real research system: something that can run, compare, ablate, and convince others. That same character is visible in Sakana’s later direction, which is less about building a mass-market chatbot and more about running a research-first lab with a distinct Tokyo and partially open-source identity.

Gomez was the youngest of the eight and one of the earliest to convert Transformer-era scientific influence into an enterprise platform. Public sources show that he was an undergraduate researcher at the University of Toronto, worked with Roger Grosse, interned and researched at Google Brain, and collaborated across both student and senior researcher settings. His personal website explicitly states that he was an undergraduate student of Roger Grosse, an intern of Łukasz Kaiser and Geoffrey Hinton, and later a doctoral student of Yarin Gal and Yee Whye Teh at Oxford. On the family side, a McKinsey profile says his parents deeply encouraged learning, and that his mother was British, studied dance, and became a librarian after moving to Canada. That combination of technical and humanistic input helps explain why his later company narrative consistently emphasizes the human side of AI.

Gomez made two especially consequential decisions. The first was entering Google Brain at the undergraduate stage and moving directly from student researcher to co-inventor. CNBC still frames him in retrospect as a Google Brain intern who helped coauthor the paper that conceptualized the Transformer. The second was leaving the academic or quasi-academic path to co-found Cohere and anchor himself in enterprise AI rather than consumer chatbot hype. As for whether his Oxford doctorate was formally completed, public materials are not perfectly consistent: Oxford’s research group page long described him as a doctoral student, while LinkedIn shows a 2018–2024 study interval.

Kaiser is the most clearly “theoretical computer scientist turned deep learning architect” among the eight. Public biographies say he was born in Wrocław, studied mathematics and computer science at the University of Wroclaw, completed his PhD at RWTH Aachen, and then worked as a tenured researcher in Paris on logic and automata theory before moving into Google’s semantic parsing work and later Google Brain. Public information on his family background is limited, but his intellectual formation is very clear: he entered modern AI not from product engineering but from logic, formal methods, and automata theory.

In the Transformer project, Kaiser’s importance was infrastructural and organizational. The paper says he and Gomez spent “countless long days” building Tensor2Tensor, replacing the earlier codebase, improving results, and drastically accelerating research. Career-wise, he is distinctive because he did not quickly turn his fame into a startup brand. Instead, he remained in high-leverage institutional research. Public materials later place him at OpenAI, contributing to GPT-4 long-context work and appearing in 2025-era talks and papers as someone who co-authored Transformers and TensorFlow-level infrastructure.

Polosukhin was the earliest among the eight to turn Transformer-era credibility into a decentralized AI infrastructure narrative. Public sources say he was born in Ukraine, studied applied mathematics and computer science at Kharkiv Polytechnic, moved to California after finishing his master’s, and then joined Google Research. Wired’s reconstruction is especially useful here: it describes him as working on direct-answer systems for Google Search, where the latency budget was brutally tight, which made efficiency and performance constraints central in his thinking.

The paper footnote says that Polosukhin, together with Vaswani, designed and implemented the first Transformer models. But an equally important turning point came before the paper’s global fame fully arrived: he left Google in early 2017 and later co-founded NEAR Protocol in 2018. Today his public identity is no longer limited to “Transformer coauthor”; it is increasingly tied to decentralized, user-owned, verifiable, privacy-preserving AI. By 2026, business reporting depicts him as actively advocating AI agent infrastructure that is auditable and not excessively dependent on any one company.

Collaboration Process and Turning Points
The actual division of labor inside the paper is almost the full explanation for why the invention succeeded. Uszkoreit provided the central direction of replacing RNNs with self-attention; Vaswani and Polosukhin built the first working models; Shazeer introduced scaled dot-product attention, multi-head attention, and the key representational choices; Parmar and Jones expanded the search space through variants, tuning, code improvements, visualization, and inference; Kaiser and Gomez transformed the whole process through Tensor2Tensor. The Transformer, then, was not just “an idea.” It was the convergence of idea, implementation, systems engineering, tooling, tuning, and organizational coordination.

That is also why the Transformer looks more like an industrial-research victory than a lone-genius breakthrough. Uszkoreit later described the project in precisely those terms: not as one overwhelming spark, but as the integration of prior attention work, optimizers, modeling judgment, implementation advances, and hardware-aware scaling. That observation is crucial because it explains why the most successful follow-on work came not from superficial paper imitation, but from labs that also had compute, systems, and research infrastructure.

The publication timeline was also unusually compressed. The paper appeared on arXiv on June 12, 2017. Google’s official explanatory blog post followed on August 31, 2017. The paper then entered NeurIPS 2017. So the interval between “working internal result” and “publicly defining a new era” was only a matter of months. The Transformer was not a slow-burn idea; it accelerated through paper release, tooling, follow-on experiments, and adoption almost immediately.

A compressed timeline looks roughly like this: 2017, the paper defines the architecture; from 2019 to 2021, the authors begin to split into differentiated organizational paths; in 2021 Adept is founded and Shazeer moves toward Character.AI; from 2019 through 2024 Cohere evolves from a high-profile research startup into an enterprise platform; in 2023 Essential and Sakana gain strong capital backing; and from 2023 to 2025 Inceptive, NEAR AI, Anthropic, OpenAI, and Gemini-related roles show how the original Transformer logic branched into biology, enterprise AI, open agent infrastructure, and frontier closed-model development.

Organizations, Capital, and Business Models
If you look only at the paper, these eight people are coauthors. If you extend the time horizon to 2026, they look more like an industrial network that radiated outward from Google Research and Google Brain into enterprise AI, consumer chat, frontier labs, bio-AI, Japan-based research labs, and decentralized AI infrastructure. Of the eight, Lukasz is the least startup-oriented in public form; the other seven all converted scientific prestige into some combination of companies, platforms, ecosystems, or investable organizational power.

Vaswani and Parmar followed a path from research architecture to agentic software and then to enterprise foundation stacks. Adept aimed to make models take actions inside software rather than merely generate text. Reuters reported in 2023 that the company raised a fresh $350 million, bringing total funding to roughly $415 million. After leaving Adept, they co-founded Essential, which announced a $56.5 million Series A in 2023 with investors including Google, NVIDIA, AMD, and Thrive Capital. Their real asset is not only equity; it is the market’s belief that they can continue defining the next software substrate.

Shazeer’s business path is closer to “research capability directly commercialized into conversational products and then partially reabsorbed by a tech giant.” Character.AI became one of the earliest major consumer products built around role-play and companion-style conversation at scale. Reuters reported that it had previously raised $193 million and reached a $1 billion valuation in 2023. The more consequential development was the Google licensing-and-rehire deal in 2024, which turned a single researcher-founder’s market value into something large enough to be discussed in multibillion-dollar strategic terms.

Gomez’s business model matured earlier than many peers into a classic enterprise software path. Cohere did not define itself as “another ChatGPT”; instead it leaned into compliance, private deployment, long-term contracts, and workflow integration for businesses. Reuters reported in 2025 that annualized revenue had reached $100 million, that about 85% of the company’s business came from private deployments, and that valuation in different 2025 reports ranged from around $5.5 billion to $6.8 billion depending on timing and round. Its real asset is not just model weights, but trusted deployment architecture, enterprise channels, and governance posture.

Uszkoreit’s Inceptive represents a different kind of commercial translation altogether: moving Transformer-era sequence intuition into RNA and therapeutic design. Public reporting says Inceptive first raised roughly $20 million in seed financing and then another $100 million in 2023 from backers including NVIDIA, Andreessen Horowitz, and Obvious Ventures. This is not an API business. It is a deep platform play built around experiments, biological sequence design, and generative modeling in life sciences.

Jones’s Sakana emphasizes a research-lab identity, a Tokyo base, and selective open release. The company announced a $30 million seed round in 2024, framed its mission around nature-inspired intelligence, and quickly released Japanese models, some of them open. Its assets therefore include equity and team quality, but also a very distinct brand position: not a Silicon Valley clone, but a Japan-origin research-first alternative AI narrative.

Polosukhin’s NEAR path is different again. NEAR Protocol is, on the surface, a blockchain network, but its current narrative clearly centers on NEAR AI, AI agents, privacy-preserving infrastructure, and user-owned AI. Its resource structure relies less on the classic VC-to-IPO path and more on protocol economics, ecosystem building, tokenized governance, and developer networks. For him, the real asset is not one product but an attempt to define a different ownership and trust model for the AI era.

Kaiser’s situation is the most unusual. He did not bind his public identity to an independent startup. Instead, he embedded his value in research infrastructure and frontier-model work inside major organizations: TensorFlow, Tensor2Tensor, the Transformer, GPT-4 long-context contributions, and later reasoning-related work. People like this do not necessarily own famous product brands, but they often hold disproportionate influence over the internal direction of model systems and research programs.

Achievements, Controversies, and Present Position
The most impressive thing these eight people achieved was not just publishing a massively cited paper. They changed AI’s default building block. Before 2017, recurrence still looked like the natural default in NLP; after 2017, self-attention progressively became the dominant scaffold. And the architecture did not stop at language. It expanded into vision, music, code, biology, agents, and multimodal systems. Google’s own blog already hinted at image and video directions in 2017, and the authors’ later careers effectively became a human timeline of those expansions.

The outside world remembers them today not because all eight names became universally famous, but because together they now occupy many of the key forks in modern AI: Shazeer on the Gemini and consumer-chat axis, Gomez on enterprise AI platforms, Vaswani and Parmar on agent automation and enterprise stacks, Uszkoreit on AI-biology, Jones on new research-lab models and Japanese AI work, Polosukhin on decentralized AI infrastructure, and Kaiser on frontier-model engineering. They are not merely historical figures; they are still actively shaping the field.

The most visible public controversies around this group are concentrated in their later commercialization paths, not in the 2017 paper itself. Mainstream coverage has not centered on serious academic misconduct allegations regarding the paper. The recurring debates are instead about open versus closed development, consumer products versus enterprise deployment, and whether large incumbents are reabsorbing talent through licensing and deal structures. Google’s Character-related arrangement was reported in the context of broader scrutiny around how big tech acquires AI talent; Cohere has openly favored enterprise deployments over mass-consumer novelty; Vaswani has publicly argued for open science; and Polosukhin increasingly argues for user-owned, privacy-first, verifiable AI. The deepest argument is no longer over authorship. It is over who will control power in the Transformer era.

Condensed to one sentence, the conclusion is this: the Transformer was not invented by one heroic genius, but by an eight-person team that simultaneously aligned theoretical judgment, implementation quality, tools, systems knowledge, organizational resources, and industrial ambition; and their later divergence now looks like a miniature map of the modern AI industry itself. If you want to understand today’s conversational AI, enterprise deployment, agent automation, RNA design, Japanese local models, decentralized AI, long-context systems, and reasoning models, many of those traces lead back to the same 2017 collaborative footnote.