The New Oil: Data, AI Supremacy, and the Rise of the Algorithmic State
- Mac Bird
- Jun 30
- 14 min read
In the early 20th century, oil reshaped empires and determined the fate of nations. In the early 21st century, another extractive resource has begun to reorganize the global balance of power: data.

I. Introduction: Data is the New Oil
In the early 20th century, oil reshaped empires and determined the fate of nations. In the early 21st century, another extractive resource has begun to reorganize the global balance of power: data.
Not just any data—but the highly structured, high-context linguistic tokens that fuel the training of large language models (LLMs). These tokens are fast becoming the foundational substrate of modern intelligence infrastructure. They are the new oil: essential to military advantage, economic productivity, and cultural hegemony.
As artificial intelligence systems grow in scale and sophistication, they increasingly rely on immense reserves of text, speech, and code. This corpus—comprised of books, emails, forum threads, legal documents, and conversation logs—is refined into tokens: discrete units of meaning used to train machine learning models. The larger and more nuanced the token supply, the more powerful the resulting model.
For a few years, the digital world seemed infinite. The internet was overflowing with human-generated data, from Reddit threads to Wikipedia pages, Substack essays, YouTube transcripts, and source code archives. Early models like GPT-2 and GPT-3 feasted on this content, achieving breakthroughs in natural language understanding. But as model size increased and token requirements grew exponentially, a problem emerged: we were consuming the internet faster than we were producing it.
OpenAI’s GPT-4 was trained on roughly a trillion tokens [1]. Gemini, Claude, and Meta’s LLaMA models are all competing for similar training volumes. Meanwhile, the global supply of high-quality, organic (i.e., human-created and context-rich) text data is rapidly shrinking. By some estimates, the stock of publicly available English-language internet text suitable for LLM training may be exhausted within a few years [2].
This scarcity has triggered a global scramble. Governments, corporations, and intelligence agencies are racing to secure what remains. Deals are being struck with publishers, social platforms, and data brokers. Legal battles over scraping and copyright are heating up. And behind the scenes, a new era of data extraction and privatization is unfolding—one that mirrors the early geopolitics of oil.
Just as petroleum-rich regions became strategic flashpoints in the 20th century, nations rich in language, culture, and text—Russia, India, Brazil, China—are now being reassessed as reservoirs of untapped token reserves. In this emerging landscape, the question is no longer who has the fastest chips or biggest models, but who controls the words, stories, and symbols they’re trained on.
This is the opening act of a new kind of resource war—one waged not over land or minerals, but over meaning. And like oil before it, the extraction and control of tokens will shape the architecture of global power for decades to come.
II. Data Scarcity and the Rise of the Token Lords
As the global supply of high-quality, publicly accessible text data approaches exhaustion, the balance of power is shifting. A new elite has emerged—those who control the remaining reservoirs of organic tokens. This elite includes major tech corporations, data-rich authoritarian states, and the intelligence-military apparatuses that increasingly partner with them.

Synthetic Data: The Mirage of Infinite Supply
Facing the reality that the web cannot supply infinite high-quality data, many firms have turned to synthetic data—text generated by AI itself and then used as training input for newer models. On the surface, this seems like a viable solution: scale the training corpus by letting the machine dream.
But synthetic data comes with a cost. A 2024 study in Nature showed that models trained on their own outputs suffer from what researchers call “model collapse”—a phenomenon where successive generations of AI grow less accurate, less diverse, and more brittle as they feed on echo chambers of prior predictions [1]. The findings demonstrated that after just a few iterations, the statistical diversity of model outputs degrades dramatically, eroding the very capacities AI was designed to enhance.
Other research has found that carefully controlled mixtures of synthetic and human-generated data may mitigate these effects [2], but the consensus is clear: synthetic data is not a long-term replacement for human language, culture, and nuance.
The Token Arms Race
This scarcity has triggered a race. Tech companies are no longer simply building bigger models—they are scrambling to secure what training data remains.
Reddit signed a licensing deal with OpenAI valued at an estimated $60 million, granting access to its vast archive of user-generated discussions [3].
Stack Overflow and Wikipedia have both explored monetizing access or restricting usage by scrapers.
The New York Times has sued OpenAI for unauthorized use of its archives in GPT training [4].
Google, Amazon, and Meta have retreated further into proprietary ecosystems—harvesting data from Gmail, Alexa, Instagram, WhatsApp, and Android logs, respectively.
The open internet—the original commons—is being enclosed. In its place, we are seeing the construction of corporate data silos and national data vaults, each guarded by legal, technical, and geopolitical firewalls.
The Platform Advantage
Those who own the platforms have a distinct edge. Google does not need to scrape; it owns the search queries. Meta does not need permission; it owns the conversation. Amazon controls both the ecommerce logs and the smart speakers listening in living rooms. Microsoft’s control of GitHub gives it one of the largest code corpora in the world.
This creates a new form of vertical integration: platforms that generate the data, capture the data, refine it into tokens, and train their models on it. As smaller actors are priced out of access to quality data, a winner-takes-most dynamic solidifies.
In this environment, open-source models face an uphill battle—not because the algorithms are worse, but because the fuel supply has been monopolized.
Data as Property, Data as Power
The transformation of data into a tradeable, securitized, and restricted resource mirrors the transformation of oil into geopolitical leverage. And just like oil, data is now a site of conflict:
Workers sue for rights to the data generated by their labor.
Governments pass localization laws to prevent data from leaving national borders.
Corporations launch legal and lobbying campaigns to prevent regulation of scraping and training.
The token economy is consolidating around those who already have massive reserves—and those willing to mine new ones from the social, linguistic, and behavioral records of billions of people, often without informed consent.
III. The Token Cold War: Geopolitics of Data Access
As access to quality training data becomes more valuable than ever, the geopolitical dimensions of the AI race are becoming stark. Control over tokens—especially culturally rich, underutilized ones—has begun to take on the strategic significance that control over oil once held in global affairs.

China: The Sealed Cognitive Superpower
China’s government has walled off its internet ecosystem from the rest of the world for over a decade. Behind this firewall lies an immense and culturally coherent corpus: chat logs from WeChat, posts on Weibo, ecommerce data from Alibaba, livestream transcripts from Douyin, and state-produced media. This data is deeply annotated, highly structured, and largely inaccessible to Western companies or researchers.
Estimates place the size of China’s unshared linguistic corpus in the hundreds of trillions of tokens. The value of this data is amplified by its cohesion—it reflects not just a language, but a worldview. China’s LLMs, such as Baidu’s Ernie and Alibaba’s Tongyi, are trained on a foundation that is both vast and shielded from external influence. This creates a closed-loop AI ecosystem that reinforces national narratives, cultural logic, and state-defined priorities.
While Western models struggle with bias, hallucination, and cultural gaps when deployed in non-Western contexts, Chinese models benefit from a massive reservoir of aligned, policy-consistent data. This provides the Chinese state with a significant soft power advantage: the ability to create AI systems that resonate with local populations across the Global South while reinforcing Chinese perspectives on global issues.
Russia: Eurasia’s Token Reservoir
Russia occupies a unique position in the token race. It holds vast reserves of high-quality, under-scraped content—ranging from academic publications and military manuals to news media, technical forums, and centuries of literary tradition.
Unlike China, Russia’s data is not tightly sealed, but it is fragmented, multilingual (Russian, Tatar, Chechen, etc.), and distributed across platforms that are poorly integrated into Western AI pipelines. As of 2025, neither the U.S. nor China has comprehensive access to these reserves.
This makes Russia a contested frontier. Whoever gains access to Russian tokens can dramatically expand their AI model’s geopolitical, cultural, and linguistic reach across Eastern Europe, Central Asia, and the Middle East. In a world increasingly shaped by LLMs, the side that wins Russia’s data wins an epistemic advantage across multiple civilizational fault lines.
This may explain the flurry of quiet partnerships being forged between Russian digital infrastructure providers and foreign firms—not unlike the oil extraction agreements that once defined Eurasian geopolitics.
The Global South: The Next Battleground
Latin America, Africa, Southeast Asia, and South Asia collectively represent the largest pool of untapped linguistic and cultural data on Earth. As internet penetration expands, these regions are generating trillions of tokens in new languages, dialects, and hybrid cultural codes every year.
Yet much of this data is either:
Scraped without consent by Western AI firms,
Locked behind national firewalls (e.g., India’s proposed data localization policies), or
Poorly annotated and excluded from training sets due to a lack of infrastructure.
India alone may possess 200 trillion tokens across 20+ languages—tokens that remain largely unintegrated into major Western LLMs. African content, especially in French, Arabic, Swahili, Yoruba, and Amharic, remains underrepresented in nearly every major model.
The geopolitical implications are clear: whoever builds trust, infrastructure, and equitable data partnerships across the Global South will control the next frontier of AI.
And unlike the oil age, where extraction occurred without regard for local empowerment, the AI age presents an opportunity to build co-governed systems. If seized, it could redefine what global digital justice looks like.
IV. From Surveillance Capitalism to Data Mercantilism
The AI data race isn’t occurring in a vacuum. It builds on two decades of surveillance capitalism—an era in which tech companies freely harvested user behavior, preferences, movements, and communications. But the emerging phase is different: the extraction of meaning itself. Not just what people do, but what they believe, say, imagine, and write.
This signals a transition to something more extreme: data mercantilism. A world in which language, culture, and thought are harvested as national assets, commodified, and traded in opaque markets or hoarded behind firewalls.
From Clicks to Cognition
In the early 2000s, digital platforms monetized attention. Google tracked clicks. Facebook cataloged likes and connections. Amazon optimized purchases. That era was marked by behavioral profiling—tracking what users did and targeting them with ads. But LLMs demand more. They require structured, semantically rich language—paragraphs, conversations, arguments, dreams.
Thus begins the mining of cognition. User-generated content becomes grist for the algorithmic mill. Entire libraries of human thought—forums, books, chat transcripts, legal opinions, therapy sessions, and voice assistant commands—are now being scraped and tokenized.
And not just for commercial purposes. Intelligence agencies have begun to integrate LLMs into signals intelligence workflows. Governments are building classified language models trained on foreign press, intercepted communications, and internal datasets. The dream of perfect surveillance now includes predictive narrative analysis: what groups might think, feel, or believe tomorrow.
This melding of commercial and state-level token collection has birthed a hybrid apparatus: the cognitive-industrial complex. As with the military-industrial complex before it, it blurs the line between public mission and private profit, between strategic asset and exploitable commodity.
The Rise of Token Sovereignty
Nation-states have begun asserting control over domestic data in ways that echo historical trade restrictions on grain, oil, or steel. India’s data localization rules, China’s Cybersecurity Law, Russia’s digital sovereignty agenda—these are not mere privacy policies. They are strategic efforts to build domestic stockpiles of linguistic capital.
In parallel, we are seeing calls for token sovereignty: the idea that communities, not corporations, should control how their cultural data is used. This has taken the form of lawsuits (e.g., The New York Times vs. OpenAI), unionization efforts by creative workers, and data trusts proposed by Indigenous and linguistic minorities.
We are entering an era in which words, phrases, idioms, and syntax—the very structure of cultural expression—are becoming contested property. Who owns Yoruba syntax? Who profits from Ukrainian folklore? Who controls the digital afterlife of human language?
Those who control the pipelines of cultural data will wield disproportionate influence—not just over technology, but over history, education, law, and belief itself.
V. The Weaponization of Meaning: Case Studies in AI-Driven Control
While token extraction is often discussed in abstract or commercial terms, its real-world consequences are already being felt in the exercise of state power. Nowhere is this more evident than in the weaponization of AI models trained on privatized or state-controlled data. These models are increasingly deployed to shape, surveil, and suppress populations.

Case Study: AI-Assisted Facial Recognition Against Journalists
In 2023, investigative journalists in Serbia reported being tracked using facial recognition systems tied to Chinese-made surveillance cameras, trained on local facial data collected without consent [1]. The journalists had been covering corruption and government collusion with criminal networks. Public-facing CCTV systems equipped with facial recognition flagged their movements across public transportation hubs, protest sites, and cafes.
This is not an isolated case. In the United States, AI-powered surveillance technologies—built in part on data extracted from social media and public event footage—have been used to monitor Black Lives Matter protestors, environmental activists, and immigration advocates [2].
Once AI systems are trained on culturally and politically salient tokens—visual or linguistic—they can be tasked with flagging, profiling, and preemptively identifying individuals likely to dissent.
Case Study: Algorithmic Deportation and Predictive Risk Models
In 2024, reports from immigrant rights groups in California revealed that ICE had begun testing LLM-powered triage tools trained on vast arrays of historical deportation records, social media activity, and court transcripts [3]. These models scored undocumented individuals by risk of “noncompliance,” potential criminal activity, and even ideological alignment based on scraped writings or texts. Those flagged received expedited deportation orders, sometimes without due process.
Palantir, a key defense contractor, reportedly integrated its predictive policing toolset with OpenAI-derived language classifiers to produce "deportation dossiers"—narrative summaries of individuals generated from fragmented digital trails.
While ICE has denied using generative AI in final decision-making, internal memos and contractor documents reviewed by whistleblowers show otherwise. These systems increasingly function as automated justifiers, cloaking racial profiling and political targeting beneath layers of mathematical abstraction.
The Deep Logic of Control
In both cases, what’s being deployed is not AI in the abstract—but meaning made computational. The ability to interpret someone’s language, facial expression, or digital trail as evidence of risk or deviance is only possible because massive LLMs have been trained on tokenized representations of humanity.
This raises a fundamental question: Who gets to decide what behavior means? In a world where language models are arbiters of credibility, threat, or legality, the training data becomes destiny.
Every data decision—what is scraped, what is included, what is excluded, what is weighted—becomes a decision about who counts, who belongs, and who is suspect.
And unlike traditional police or legal tools, these systems operate beneath the surface, obscured by proprietary code, trade secrecy, and claims of AI neutrality. They are trained on us, but not for us.
We are watching the emergence of AI as a semi-invisible structure of meaning enforcement. Not merely surveillance, but ontological governance: determining what is true, what is possible, what is dangerous.
And the implications extend far beyond deportation or protest suppression. From banking to healthcare, education to employment, models trained on privatized token archives will increasingly shape the terrain of opportunity—and exclusion.
Those with access to the full corpus of meaning will govern not just what machines can say, but what they can see, think, and justify. In that sense, the token war is also a war over reality itself.
VI. Toward a Post-Token Future: Resistance, Regulation, and Renewal
If tokens are the new oil, then the token war demands not just extraction but resistance. The rise of a data-based global order has not gone uncontested. Across the world, artists, technologists, policymakers, and grassroots movements are mounting challenges to the monopolization and militarization of language.
Resistance from the Margins
From the favelas of Brazil to the digital collectives of the Philippines, activists are building parallel datasets—corpora grounded in indigenous languages, LGBTQ+ experiences, refugee narratives, and unrecognized dialects. These are not merely symbolic projects; they are acts of epistemic defiance. By asserting the right to shape and train AI on their own terms, these groups resist being flattened into mere statistical residue.
Groups like the Masakhane project (for African NLP) and the Indigenous AI Network are producing open-source, culturally aware data that challenges the dominance of Anglo-centric, corporately harvested token regimes.
Legislative and Legal Pushback
Courts and legislatures are beginning to confront the implications of AI trained on scraped data. The European Union’s AI Act and Digital Services Act introduce requirements for transparency, data provenance, and the right to opt out of model training. In the United States, copyright lawsuits (e.g., The New York Times vs. OpenAI) may redefine the boundaries of fair use in the age of AI.
Several nations are proposing "model labeling" laws, requiring AI outputs to disclose whether and how the underlying model was trained on sensitive or private content. Others are exploring licensing regimes—where cultural or linguistic communities receive royalties for training data use.
While the effectiveness of such measures remains to be seen, their very emergence signals a recognition: tokens are not neutral. They are political, economic, and ontological constructs—ones that must be governed with care.
Envisioning AI for Collective Liberation
The deepest critique of the token economy is not that it is unjustly distributed—but that it is unimaginative. What if AI were not built to serve advertising and enforcement, but healing and emancipation?
Imagine:
AI models trained to revitalize endangered languages.
Generative systems that collaborate with communities to encode oral histories.
Models that center restorative justice, not predictive punishment.
To build such systems, we must rethink the token pipeline. We must ask:
Who curates the corpus?
Who decides what stories matter?
Who benefits from the model’s predictions?
The future of AI will not be decided by algorithms alone. It will be shaped by the collective struggles over who controls the words, meanings, and memories that train them.
And in that struggle, there is hope. Because unlike oil, tokens can be rewritten. New stories can be told. New patterns can be learned.
We are still early in the age of language machines. The pipelines are being laid. The parameters are still tuning. The future is still, in a profound sense, up for training.
VI. Toward a Post-Token Future: Resistance, Regulation, and Renewal
If tokens are the new oil, then the token war demands not just extraction but resistance. The rise of a data-based global order has not gone uncontested. Across the world, artists, technologists, policymakers, and grassroots movements are mounting challenges to the monopolization and militarization of language.
Resistance from the Margins
From the favelas of Brazil to the digital collectives of the Philippines, activists are building parallel datasets—corpora grounded in indigenous languages, LGBTQ+ experiences, refugee narratives, and unrecognized dialects. These are not merely symbolic projects; they are acts of epistemic defiance. By asserting the right to shape and train AI on their own terms, these groups resist being flattened into mere statistical residue.
Groups like the Masakhane project (for African NLP) and the Indigenous AI Network are producing open-source, culturally aware data that challenges the dominance of Anglo-centric, corporately harvested token regimes.
Legislative and Legal Pushback
Courts and legislatures are beginning to confront the implications of AI trained on scraped data. The European Union’s AI Act and Digital Services Act introduce requirements for transparency, data provenance, and the right to opt out of model training. In the United States, copyright lawsuits (e.g., The New York Times vs. OpenAI) may redefine the boundaries of fair use in the age of AI.
Several nations are proposing "model labeling" laws, requiring AI outputs to disclose whether and how the underlying model was trained on sensitive or private content. Others are exploring licensing regimes—where cultural or linguistic communities receive royalties for training data use.
While the effectiveness of such measures remains to be seen, their very emergence signals a recognition: tokens are not neutral. They are political, economic, and ontological constructs—ones that must be governed with care.
Envisioning AI for Collective Liberation
The deepest critique of the token economy is not that it is unjustly distributed—but that it is unimaginative. What if AI were not built to serve advertising and enforcement, but healing and emancipation?
Imagine:
AI models trained to revitalize endangered languages.
Generative systems that collaborate with communities to encode oral histories.
Models that center restorative justice, not predictive punishment.
To build such systems, we must rethink the token pipeline. We must ask:
Who curates the corpus?
Who decides what stories matter?
Who benefits from the model’s predictions?
The future of AI will not be decided by algorithms alone. It will be shaped by the collective struggles over who controls the words, meanings, and memories that train them.
And in that struggle, there is hope. Because unlike oil, tokens can be rewritten. New stories can be told. New patterns can be learned.
We are still early in the age of language machines. The pipelines are being laid. The parameters are still tuning. The future is still, in a profound sense, up for training.
References
https://www.businessinsider.com/reddit-openai-data-license-deal-2024-5
https://www.nytimes.com/2023/12/27/technology/new-york-times-openai-lawsuit.html
https://carnegieendowment.org/2023/10/12/china-s-ai-nationalism-and-firewalled-training-data
https://www.brookings.edu/articles/russia-data-sovereignty-and-ai-geopolitics/
https://www.eff.org/deeplinks/2023/06/scraping-and-data-ownership-who-controls-public-web
https://balkaninsight.com/2023/08/30/facial-recognition-serbia-journalist-surveillance/
https://www.justiceinitiative.org/voices/the-algorithms-behind-deportation
https://restofworld.org/2024/indigenous-ai-open-source-local-language/
