We are running out of data
What happens when AI models have digested everything
Artificial intelligence systems are powerful pattern recognition machines. They learn by absorbing large volumes of text, images, code, and other data, then spotting the statistical regularities that link one fragment to the next. The more data they see, the more subtle the patterns they can capture.
In 1950, Claude Shannon built a mechanical mouse called Theseus that learned its way through a maze. Each time it hit a wall, a relay circuit flipped. Over time, the mouse “remembered” which paths were blocked and could find the route to the goal. The entire learning process relied on just 40 binary data points.
AI models today are trained on trillions of data points and use deep neural networks with billions of parameters to represent complex patterns and relationships.
The underlying logic is the same as Theseus: experience accumulates as data, and behaviour improves as the system captures more structure. What has changed is scale.
If you plot the amount of training data for notable AI systems over time on a logarithmic scale, you will see that the volume of training data for leading models has roughly doubled every nine or ten months. This shift in scale helps explain the change people feel when they use new AI tools.
When a model has seen that much data, it has been exposed to almost everything that shows up in everyday conversation, technical writing, corporate communications, and online debate.
There is, however, a hard constraint in the background.
Humanity only produces so much high-quality text, code, and labelled data each year. Books, academic papers, news articles, legal documents, public code repositories and long-form commentary are finite. As models grow, the demand for this kind of material grows with them.
So, what happens when models have been trained on most of the data available?
At that stage, simply adding more data is no longer the path to better models.
One response is synthetic data: AI systems can generate huge quantities of artificial text and use it for further training. That is already happening in narrow domains. But if models train too heavily on their own outputs, the errors and biases in those outputs can loop back into the system and compound over time.
The last fifteen years of AI progress have been powered by three reinforcing trends: more compute, better model architectures, and ever larger datasets.
The data part of that story now has a ceiling.
When you see AI as a pattern learner whose power is tied to its data, you also see the emerging constraints. And constrained systems are where strategy starts.
—————
I write about Thematic Strategy - a method I developed during my PhD that directs firms to leverage drivers of technological and social change to achieve market dominance.
> Subscribe to my newsletter for my latest research.
> Follow me for more like this.


