Free Data, Trillion-Dollar Impact: How Common Crawl Feeds the Entire AI World

The first thing to clarify is who the “founders” actually are. By Common Crawl’s own official wording, the sole official founder is Gil Elbaz; the official history page and a 2011 retrospective show that Carl Malamud and Nova Spivack joined the board in 2008, while Ahad Rana was the early engineer who actually built the first crawler and processing pipeline. So the strict formulation is: Gil Elbaz is the founder; Nova Spivack and Carl Malamud are earliest board-level institutional co-builders; Ahad Rana is the key early technical builder.

Common Crawl is not a conventional “database company.” It is closer to a public infrastructure layer for the internet age. It is a 501(c)(3) nonprofit founded in 2007, has preserved open-web crawl data continuously since 2008, and—by its 2026 official description—now maintains more than 10 PiB of archives spanning more than 15 years and more than 300 billion pages, while continuing to publish new crawls monthly, typically with more than 2 billion pages per release. The data is hosted for free through Amazon Web Services’ Open Data Sponsorship Program.

What transformed Common Crawl from a technical public good into a globally important infrastructure was large-model training. Mozilla Foundation’s 2024 research on 47 public text-generation LLMs released from 2019 to 2023 found that at least 64% used filtered versions of Common Crawl, and the GPT-3 paper explicitly states that a majority of its training tokens came from filtered Common Crawl. In other words, Common Crawl moved from being an “open web archive” to being one of the foundational raw-material markets for generative AI.

Its early timeline can be compressed into four key jumps. The project was conceived in 2007 with the goal of making web-scale crawl data available to researchers who could not afford search-engine-scale infrastructure; it began collecting data in 2008 with a custom Hadoop-based crawler; it moved onto AWS in 2012 and also received donated metadata from blekko to improve coverage and spam filtering; then in 2013 it switched to CCBot based on Apache Nutch and moved from ARC to the more standard WARC format. This was the phase in which Common Crawl turned from an idealistic concept into a sustainable data factory.

The second major jump happened around the AI boom. In 2019 Google used a Common Crawl snapshot to build C4 for T5; in 2020 GPT-3 pulled Common Crawl into the spotlight; in 2023 Rich Skrenta became executive director; in 2024 the organization joined the End of Term Web Archive; and in 2025–2026 it launched projects such as the Opt-Out Registry, the Web Languages Project, and CommonLID. At that point it was no longer just “crawling websites”; it was also building rules, quality signals, annotation systems, and research-community coordination.

From a 2026 vantage point, Common Crawl’s strategic value lies less in any single technical breakthrough than in continuity. It is not a one-off dataset; it is an openly reusable, steadily updated lower layer that downstream researchers and companies can repeatedly clean, filter, and reconstruct. Once that kind of infrastructure creates path dependence, it becomes very difficult to replace, because papers, derivative corpora, training pipelines, and policy debates start growing around it.

Gil Elbaz’s public family background is incomplete. Exact birth details and parental occupations are publicly limited. What can be confirmed is that he grew up in Cincinnati and San Antonio, and that he was obsessed from childhood with almanacs, weather data, and numerical patterns; by his own telling, his parents “weren’t mathematicians,” yet he wanted them to create problem sets for him. That detail matters because it explains the through-line of his later career: semantic advertising, open data, and Common Crawl all stem from the same instinct to treat reality as something that can be structured, indexed, and computed.

Educationally, Gil graduated from the California Institute of Technology in 1991 with a double major in Engineering/Applied Science and Economics. That combination became the template for most of his career: hard technical systems building on one side, and a constant focus on scale, market structure, and information value on the other. Public talks and profiles also suggest that he never saw himself as a conventional PhD-track academic; he was much closer to a broad-spectrum technical entrepreneur aimed at large real-world systems.

Professionally and entrepreneurially, Gil worked after college in engineering and database-related roles at IBM, Sybase, and SGI; in 1998 he co-founded Applied Semantics; after Google acquired it in 2003, he became engineering director for Google Santa Monica; and in 2007 he founded Factual. Common Crawl emerged as the nonprofit branch of this broader entrepreneurial chain: Applied Semantics demonstrated the business value of semantic understanding in search and advertising; Factual demonstrated the business value of structured open data; and Common Crawl pushed the same worldview into public infrastructure.

That is why Common Crawl should not be read as a side-charity project by Gil Elbaz. It is better understood as the institutional externalization of his career philosophy. The 2011 official retrospective explicitly says he believed that falling storage and bandwidth costs, together with lower barriers to big-data processing, made an open web-scale crawl repository both feasible and worth building. His broader network of nonprofit and civic commitments, including XPRIZE and family philanthropy, reinforced that “technology plus public infrastructure” path.

Nova Spivack’s upbringing was very different from Gil’s. Public sources indicate that he grew up in the Boston area in a hybrid family environment shaped by art and invention: his father was an inventor/artist and his mother a poet. He later recalled a household that did not center television, but instead reserved multiple rooms for making things, painting, and inventing. The deepest influence on Nova was not just technology; it was the idea that technology, art, spirituality, and future-thinking could coexist in the same worldview.

In education and early life, Nova graduated from Oberlin College. While there he moved among computer science, art history, studio art, and philosophy of mind, and later spent a long period in a Tibetan Buddhist monastery in Nepal; he has said that this experience gave him a real sense of direction and purpose for the first time. After returning to the United States, he worked at the intelligent-filtering company Individual Inc. and then co-founded EarthWeb, entering the worlds of online communities, developer media, and semantic-web thinking at a very early stage. Within Common Crawl, his contribution is best understood as vision, semantic-web perspective, and open-internet ideology—not the hands-on execution of crawler engineering.

Carl Malamud’s place in Common Crawl’s history is closer to that of an institutional ally from the open-public-information movement than that of a technical founder. The official 2011 retrospective states that he joined the board in 2008. His more central long-term identity is as the founder of Public.Resource.Org, and as a major advocate for public access to government and legal information. His family background, childhood, and formal educational path are publicly limited. But from a career standpoint, bringing him into the board added a strong layer of public-interest legitimacy, legal framing, and openness politics to Common Crawl.

In terms of early technical execution, the person who arguably deserves separate emphasis is Ahad Rana rather than any board director. The official 2011 retrospective explicitly credits him with building the early crawler and processing pipeline, allowing the organization already at that stage to cover roughly 5 billion pages and expose analytically useful metadata such as PageRank and link graphs. In other words, Common Crawl was never just “scrape first, think later”; from the beginning it tried to become an analyzable, computable, reusable open-data infrastructure.

If Common Crawl is analyzed as an infrastructure asset, its most important real assets are not its logo or slogans, but four layers: the long-lived archive, a regularly published index, a crawler and priority system centered on CCBot, and the host/domain-level Web Graph. The official site explains that raw data is stored in WARC, with WAT/WET derivatives plus CDXJ and columnar indexes, and that Harmonic Centrality and PageRank are used in crawl prioritization. Together, these form a web-scale data plant that very few institutions can build and maintain over long periods.

Its “brand capital” operates on three levels. First, it is free, open, and reproducible, which lowers research barriers. Second, it updates continuously, creating a de facto default raw-material market for open-web text and metadata. Third, it benefits from network effects: researchers, open-source communities, and model builders create derivative corpora, cleaning pipelines, and benchmarks on top of it. The official About page says it has been cited in more than 12,000 papers, while the organization’s own citation analysis shows growth from 30 Google Scholar citations in 2012 to 1,777 in 2023.

In capital and revenue terms, Common Crawl is nearly the inverse of a normal database startup. It has no standard equity-funding story, no SaaS subscription model, and no user-facing API monetization layer. It survives through donations, sponsorship, and infrastructure support. The official About page says its primary funding comes from the Elbaz Family Foundation; the organization’s 2025 public response further says that for roughly fifteen years it was supported almost entirely by that trust, with only relatively recent and smaller AI-company donations. At the same time, AWS absorbs a key hosting burden through the Open Data Sponsorship Program, effectively platformizing one of the infrastructure’s heaviest cost centers.

The financial data shows how it scaled from a relatively small public-interest engineering project into an AI-era infrastructure institution. ProPublica’s IRS-based records show revenue of only $75,000 in 2020; $330,000 in 2021; $451,000 in 2022; then a jump to about $1.30 million in 2023 and $1.47 million in 2024, with 2024 expenses of about $1.33 million and net assets of about $1.47 million. It remains small, but it is no longer just a volunteer-colored side project; it now has an executive director, a CTO, research engineers, and legal capacity.

Governance has also gone through a generational transition. In 2023 tax filings, Nova Spivack, Gil Elbaz, and Carl Malamud still appeared as unpaid directors/officers. But by the 2026 team page, the board had become Gil as chair, Eva Ho as board member, Michael Birnbach as treasurer, with Rich Skrenta running operations as executive director. That transition matters because it shows Common Crawl moving from a founder-driven idealistic project toward a more durable institution with managerial, research, legal, and partnership layers.

Its collaboration network also shows that it is no longer just a “raw data warehouse.” The collaborators listed on the official site include the Allen Institute for AI, Hugging Face, MLCommons, EleutherAI, Johns Hopkins University, and the International Internet Preservation Consortium, among others. In other words, the network it depends on is not a classical VC network but a hybrid of cloud platforms, open-source communities, research institutions, and policy forums.

Common Crawl’s greatest success is not that it built a famous end-user product. Its real achievement is that it turned “crawling the open web” from an ability reserved for large search companies into a reusable public resource for researchers, entrepreneurs, and open-source communities. Many later datasets, data-cleaning pipelines, and training corpora could start from an already crawled internet instead of paying the enormous cost of crawling it themselves. Mozilla’s assessment is apt: Common Crawl improved access to training data, increased competition, and provided some additional visibility and transparency.

In concrete AI outcomes, Common Crawl is not the direct trainer of models, but it sits upstream of almost everyone. Google built C4 from a Common Crawl snapshot to train T5; GPT-3 drew a majority of its training tokens from filtered Common Crawl; and the official site lists derived systems such as CCNet, OSCAR, Pile-CC, RefinedWeb, FineWeb, and CommonLID. At this stage, Common Crawl’s position looks much more like a crude-oil market than a finished consumer brand—the big money is made by refiners, but the whole system still depends on the pipeline feeding them raw input.

Another important turning point is that the organization has begun trying to repair its most criticized weaknesses. The official site acknowledges chronic overrepresentation of English, which led to the launch of the Web Languages Project in late 2024; then in 2026 it released CommonLID, covering 109 languages, with MLCommons, EleutherAI, and Johns Hopkins. This shows a shift in institutional identity: it is no longer only about scale; it is increasingly about quality, distribution, fairness, and annotation.

But its controversies are serious, and they are not peripheral. The first major category is copyright and deletion requests. In 2024 the Danish Rights Alliance demanded removal of Danish media content, and Wired reported that Common Crawl said it would comply. In 2025, The Atlantic published an investigation claiming that Common Crawl’s archives still retained large quantities of material from publishers including The New York Times and questioned both the reported removal progress and the accuracy of the site’s search tool. Common Crawl publicly rebutted the accusation, arguing that it had not lied to publishers and that removing historical archive material is technically difficult. The core dispute, then, is not merely whether robots.txt was respected; it is the gap between historical scraping, deletion promises, interface visibility, and downstream AI reuse.

The second controversy is positional. In its 2025 submission to the UK copyright-and-AI consultation, Common Crawl openly supported a clearer legal exception for text and data mining and argued that the “right to read” should encompass the “right to mine.” That means the organization is no longer viewed by critics as a neutral warehouse; it has become an active policy actor. To advocates of open data, that is a necessary intervention. To copyright holders, it means Common Crawl has moved from infrastructure into rule-shaping.

The third controversy is data hygiene and security. In 2025, researchers scanning the December 2024 Common Crawl archive found 11,908 still-valid API keys and passwords. Strictly speaking, this does not mean Common Crawl intentionally leaked secrets, but it does show that “raw open-web capture plus large-scale reuse” can amplify ordinary public-page misconfigurations into systemic risk. For anyone using Common Crawl as training material, that means post-processing, de-sensitization, and rights governance cannot simply be outsourced to Common Crawl alone.

The fourth line of criticism concerns representativeness and bias. Mozilla’s work and Stefan Baack’s FAccT paper both argue that, despite its huge scale, Common Crawl is not a neutral mirror of the entire internet; crawl prioritization, language coverage, filtering methods, and the unequal capacity of downstream players to clean the data all transmit bias into later models. Common Crawl’s own recent public discussion of these issues suggests that it also recognizes that the “open is enough” era is over.

As of April 2026, Common Crawl’s real-world position can be summarized in one sentence: it is not the most profitable company, and it is not the most visible AI product, but it is a crucial middle layer connecting the open web, academic research, open-source data curation, and large-model training. The latest official crawl is CC-MAIN-2026-12; its team page shows a full engineering, research, legal, and program-management structure; and it continues to publish Web Graphs, crawl statistics, CommonLID, and new examples/resources tooling. Its historical place is not primarily in a profit-and-loss statement, but in path dependence: once papers, cleaning pipelines, training corpora, benchmarks, and policy controversies all start organizing around it, it stops being merely an archive and becomes an institutional substrate. From here on, the key question will not simply be how much it crawls, but whether it governs credibly, whether its deletion mechanisms are trustworthy, whether language and copyright imbalances are being repaired seriously, and whether it behaves more like a public library or an AI raw-material supplier.