The “Stochastic Parrot” observation has always been intriguing to me. What I find most interesting is that the conversation so often focuses on the content being produced, rather than the mechanism. The concept of parroting brings to mind regurgitation, but as we know the model size is less than that of the training data so it’s not a simple look up.
This size reduction suggests compression, which then introduces the question of whether it is lossy or lossless. At a token/word level it appears to be lossy, but that introduces the question of whether it is lossy at a semantic or higher level. When we look at the linguistics stack (in this case for written language) the question is whether semantic concepts are preserved even while the exact syntax has loss.

When you reflect, this is actually the power of LLMs: the pressure of training to achieve lossy compression forces abstractions. It is these abstractions, at deeper and deeper layers, that provide the shocking capabilities of LLMs (well they were shocking, a couple of years ago).
So are these abstractions something that gets parroted? Perhaps, to some extent. And if so it’s the sheer depth (height?) of abstraction that gives such a convincing display. It feels like more than tropes being recycled because some of the tropes are well beyond our ability to recognise. Generalisations that aren’t recognised as such tend to get admired as savvy insights.
But if it’s pattern matching all the way down, that points to two options:
- We add layers, building more abstractions, and end up at the single abstraction to rule all other abstractions. A unified Abstraction of Everything?
- We broaden to include more and more detail, to make the model less lossy, and in doing so incrementally lose the point of a model.
Perhaps this is the challenge of LLMs: we either end up with an unintelligible truth, or our map ceases to be a model. Perhaps the lesson is that we must choose: comprehensible models or complete maps, but never both. And maybe we must accept that humans can only grapple with a narrow band of reality between the two.
And with that in mind, I present Demis and Sam Revisited (with apols to Lewis Carroll)
“That’s another thing we’ve learned from your Nation,” said Mein Herr, “transformer models. But we’ve carried it much further than you. What do you consider the largest model that would be really useful?”
“About 100 Billion parameters.”
“Only 100 Billion!” exclaimed Mein Herr. “Why that would only hold the books. We very soon got to 1 trillion by feeding it every witticism on Reddit. Then we tried 10 trillion by giving it every conversation, every sigh, every idle fancy that ever wandered through a human head. And then came the grandest idea of all, we should provide it all the data in existence! We actually made a model of the universe, on the scale of the universe!”
“Have you used it much?” I enquired.
“It has never been booted, yet,” said Mein Herr: “We proposed, for convenience, to replace the universe with it. But there were objections from the farmers who insisted the real one was still in use. So we now use the universe itself as its own model, and I assure you it does nearly as well.”
