Frontier Data and Physical AI: The New Gold Rush of AI


Frontier Data and Physical AI: The New Gold Rush of AI


In the past decade, artificial intelligence has grown by primarily feeding on the same resource: public web data. Texts, images, documents, forums, news, blogs, repositories… an enormous amount of material that models have absorbed to build their language and cognitive abilities. But this phase is about to end.

According to projections cited by Messari, the total amount of public text available for model training—approximately 300 trillion tokens—could be completely exhausted between 2026 and 2032. This means that large models have “eaten the internet,” and now they need something else. The next frontier for AI will no longer be the web: it will be the real world.

And this is where the concept of frontier data comes into play, the resource that will define the competitiveness of future models. Video, audio, sensory, motor, robotic data, action data, data generated from interaction with the physical world or complex digital interfaces. Data that cannot simply be downloaded: they must be collected, coordinated, verified, and, above all, incentivized.

For this reason, the blockchain is not a detail or a marginal addition: it is the infrastructure that enables the orchestration of this new data economy.


The End of “Web Scraping” and the Beginning of High-Value Data

The most advanced models of 2025—not only linguistic but also multimodal, agentic, and reasoning-oriented—no longer improve with the mere addition of generic textual datasets. They require something much more specific and much more expensive to collect: data that reflects actions, intentions, movement, interaction, manipulation, context.

This is the case, for example, with computer-use agents, AI capable of interacting directly with the computer as a human would. To train these systems, textual descriptions are not enough: “trajectories” are needed, which are actual recordings of people performing tasks on the screen.

A protocol like Chakra, mentioned in the report, has developed an extension that allows users to record their screen while performing daily tasks: navigating a management system, preparing an Excel document, editing images, using professional software. These recordings become invaluable material for training models like GLADOS-1, the first computer-use model built almost entirely on crowdsourced data.

And this is precisely the point: these data do not exist until someone produces them. And they must be paid for. Just like energy or inference is paid for.


The Increasing Value of Gameplay-Action Pairs

Another striking example comes from the gaming world. A platform like Shaga, born as a decentralized cloud gaming network, produces an extremely valuable byproduct: the so-called Gameplay-Action Pairs (GAP), which are synchronized pairs of what happens on screen and the commands the player issues.

These are data that cannot be retrieved simply by watching videos on YouTube: they need to be captured at the source, on the player’s device. And this type of dataset, according to estimates reported by Messari, can be worth up to $50–$100 per hour of gameplay.

To put it into context: Shaga has already accumulated over 259,000 hours of gameplay, with an estimated value of more than 26 million dollars. And it’s no coincidence that OpenAI, a year earlier, offered half a billion to acquire Medal, a similar platform specializing precisely in gameplay recording.

These data are used to train world models, models that do not merely interpret language but simulate physics, causality, and agent-environment interaction. These are the models that will enable more intelligent robots, autonomous agents, advanced forecasting systems, and AI capable of “moving” in complex environments.


Physical AI: intelligence entering the physical world

And this is precisely where we arrive at the second major wave of frontier data: robotic data.

The AI of the future will not only reside in data centers. It will live in robots, drones, autonomous cars, distributed sensors, and smart home devices. Each robot will need data to learn how to move, identify objects, make decisions, and manipulate environments. And this data collection is incredibly costly: it requires physical hardware, human operators for teleoperation, continuous maintenance, and coordination.

Projects like PrismaX, BitRobot, GEODNET, and NATIX are beginning to use incentivized mechanisms typical of Web3 to distribute this cost across a global network of contributors. Instead of having a single company collecting robotic data, thousands of users can do so in a coordinated manner, receiving direct compensation.

It’s the same logic as mining: but instead of computational power, here the contribution is the real data.


Machine-to-machine coordination: when AI acts in the real world

If robots and AI agents truly begin to interact with the physical world, a completely new level of coordination is required. Robots will need to:

  • identify each other,
  • transact payments,
  • purchase services,
  • consume data,
  • execute tasks in a verifiable manner,
  • demonstrate having performed an action,
  • rely on shared ledgers of identity and reputation.

This is where initiatives like OpenMind and Peaq emerge, attempting to build an onchain infrastructure dedicated to the communication and identity of robots. An equivalent of DNS, but for machines. A system where drones, autonomous cars, robotic arms, or industrial systems can signal their presence, certify their actions, pay other systems, and exchange services.

It is the beginning of the machine economy, an economy populated by non-human entities that interact autonomously on decentralized networks.


Certified Real Data: The Role of IoTeX and DePIN Networks

The report also places significant focus on IoTeX, a protocol that in recent years has transformed its infrastructure into a comprehensive platform for the collection, certification, and orchestration of real-world data.

IoTeX enables the connection of sensors, IoT devices, home systems, and industrial equipment, providing:

  • a verified onchain identity for each device,
  • a data aggregation system,
  • a level of cryptographic attestation via ZK,
  • APIs that allow AI agents to utilize that data in real-time.

Today, IoTeX coordinates over 16,000 devices and dozens of vertical projects, providing AI agents with the ability to access verified data from the real world. A significant difference compared to simple scraping.


The Endpoint: Data as a Financial Asset

According to Messari, the trajectory is clear: data is becoming a financial asset in every respect. Just as today one can invest in compute, GPU, and colocation, in the future it will be possible to invest in “data streams,” purchase usage rights, support networks that collect frontier data, and in return, receive economic returns.

It’s an almost inevitable evolution: if data becomes scarce, valuable, and difficult to produce, it will then have a market, a price, demand, and supply.

Blockchain, once again, is the ideal layer for:

  • coordinate this economy,
  • verify its integrity,
  • trace the provenance,
  • distribute the compensations,
  • protect users,
  • support global scalability.

Conclusion

AI will not advance through increasingly larger models, but through richer data, sourced from the real world and collected via global networks of contributors. It is the greatest gold rush of the next decade: not that of chips, but that of data.

Web3 protocols are not a mere detail: they are the natural platform for collecting, verifying, distributing, and compensating those who provide this data. If the web was the raw material of the first AI wave, the real world will be the raw material of the second.

And this time, for the first time, the collection will not be controlled by a few giants, but by the networks.

Open, incentivized, decentralized networks: the new infrastructure of frontier data.



Source link