

Knowledge and Intelligence ARE Mutually Exclusive.
#knowledge-and-intelligence-are-mutually-exclusiveHey everyone 👋,
I have been building toward this post for a long time, longer than any of the others, and I want to start by being honest about why it took so long. The claim I am about to make is uncomfortable. It goes against something that feels intuitive to most people, including to me for most of my intellectual life. But I have reached a point where I cannot keep writing around it, cannot keep softening it with caveats and qualifications, because the evidence has accumulated past the point where softening it is still honest. The claim is this: knowledge and intelligence are mutually exclusive in a very specific and very important sense. Not completely, not forever, not in every possible world. But in the way that matters most for building genuinely intelligent machines, the way that matters for understanding what happened when a language model fails, the way that matters for explaining why scaling alone will never produce real understanding, knowledge and intelligence point in opposite directions, and you cannot have both at the same time without knowing exactly which one you have and where each one ends.
I have circled this idea in almost every post I have written. In Language is Limited. ASI is Impossible., I argued that the medium of text is not the medium of reality, and that a system living inside text is forever separated from the world by the compression loss of language. In LLMs are Useful. LMMs will Break Reality, I argued that equations are more powerful than sentences and that simulation is the real intelligence. In Mathematical Equations are Multimodal by default, I argued that mathematical structure encodes mechanism in ways that language never can, and that the path to genuine understanding runs through structure rather than through text. In Training Is an Evil Concept. LMMs Eliminates it Altogether., I argued that extracting value from human creative output to build statistical models is both morally corrupt and intellectually bankrupt. And in Genuine Intelligence will never in trillion years emerge from neural networks., I made the most direct version of the argument yet, showing that the architectural gaps in neural networks are not fixable bugs but defining properties of what those systems are. All of those posts were preparation for this one. This one is the piece that ties them together, the piece that names the relationship between the two things that every AI conversation confuses, and explains why confusing them is not just an intellectual mistake but a practical catastrophe that is unfolding right now, in the systems being deployed in medicine, law, education, and engineering, in the decisions being made about what to trust and what to build.
I know that claim might sound excessive. I know that a lot of what I write sounds excessive to people who are more comfortable with the moderate position, the "both sides" view that says language models are useful tools with limits and we should just be careful and everything will work out fine. But I have been watching this space long enough to know that moderation is often just another word for not looking carefully enough. And when I look carefully at the distinction between knowledge and intelligence, what I see is not a minor technical nuance. I see the single most important conceptual error in the entire field of artificial intelligence, and I am going to spend this whole post trying to make you see it too.
What We Mean When We Say Knowledge
Before I can make the argument, I need to spend some time on the definitions, because I have found that most disagreements about AI come down to people using words differently, and if we do not start from a shared understanding of what knowledge actually is, the rest of the argument will slide past each other without ever connecting. So let me be careful here, more careful than I usually am, because the distinction is subtle enough to lose if I rush it.
Knowledge, in the sense I am using it here, is stored pattern. It is the accumulated residue of past encounters with reality, encoded in some medium that allows it to be retrieved and applied later. A textbook contains knowledge. A database contains knowledge. A trained neural network contains knowledge, in the form of weights that have been adjusted to fit patterns in training data. Knowledge is retrospective. It points backward in time, toward the patterns that already existed in the world and have been captured in some representational form. Knowledge is also inherently limited by the observations that created it. You can only store patterns that your past experience has exposed you to, which means knowledge is always bounded by history, always a function of what has happened, never of what could happen for the first time. Knowledge is the library. It is enormous, it is valuable, it is the source of almost everything that educated humans can do. But it is not intelligence. It is the raw material that intelligence operates on, and treating them as the same thing is the mistake that has corrupted the entire AI conversation.
When a language model is trained, what happens is that the system ingests a very large corpus of text and adjusts its parameters to minimize prediction error across that corpus. The result is a system that has absorbed an extraordinary amount of statistical structure from the text. It has learned that certain words follow certain other words in certain contexts, that certain topics are discussed in certain ways, that certain questions tend to receive certain kinds of answers. All of that statistical structure is knowledge, in the sense I am using the word. It is pattern stored in weights. It is enormously useful. It is what allows the model to produce fluent, contextually appropriate responses across an enormous range of topics. But it is all retrospective. Every pattern the model has learned is a pattern that existed in the training data. Every response the model produces is a recombination and interpolation of patterns from its past. The model has no mechanism for generating understanding that is not already present, at least implicitly, in the distribution of its training data. That is not a temporary limitation that will be overcome by larger models or better data. It is the definition of what knowledge-based systems do, and it applies to every system that learns by fitting patterns to historical data, regardless of how large or sophisticated the system becomes.
Now consider what happens when you ask a genuinely knowledgeable person a question in a domain they have studied deeply. They draw on their stored patterns, yes, but they also do something more. They reason. They connect the patterns in new ways. They notice when a question is unlike anything they have seen before and they flag that distinctiveness. They identify the tensions and contradictions within their own knowledge and use those tensions as signals that something needs to be worked out more carefully. They know the limits of what they know. That last capability, knowing the limits of your own knowledge, is one of the most important things a mind can do, and it is something I have written about before, most directly in the post about intelligence never emerging from neural networks. It is something that knowledge-based systems do very poorly, because a system that operates by pattern completion has no principled basis for knowing when it is operating outside the domain of its training data. It just produces the most likely completion and presents it with the same confidence it would have if the question were well inside its training distribution. That systematic overconfidence is the direct consequence of treating knowledge as intelligence, of assuming that having patterns is the same as understanding when and how to apply them, and not a bug in the training procedure.
Let me also be specific about the different kinds of knowledge, because not all knowledge is equally problematic when it is mistaken for intelligence. Declarative knowledge is knowing that something is true. Procedural knowledge is knowing how to do something. Causal knowledge is knowing why something is true, what mechanism produces it, and what would happen if conditions changed. These three kinds of knowledge are related but distinct, and the distinction matters enormously for AI. Language models are extraordinarily good at declarative knowledge, at knowing that. They are decent at procedural knowledge in domains where the procedures are well-represented in text. They are very poor at causal knowledge, at knowing why, because causal knowledge requires a model of the mechanism that generates the pattern, not just a representation of the pattern itself. And causal knowledge is the foundation of genuine intelligence, because intelligence is fundamentally about figuring out what to do in situations that are new, and figuring out what to do in new situations requires understanding why things happen, not just knowing what usually happens. Research by Judea Pearl has made this distinction precise in the language of causal inference, showing that systems that can only learn from observation will always fail at predicting the effects of interventions, because observational data can only teach associations while intervening requires causal structure (1). That result is not just a theorem in statistics. It is the mathematical proof that knowledge is not intelligence.
Let me also say something about the role of memory in all of this, because memory and knowledge are often conflated, and the conflation makes things worse. Memory is the capacity to store and retrieve specific past experiences. Knowledge is the distillation of patterns across many experiences. Neither is the same as intelligence. A person with perfect memory who can recall every detail of every conversation they have ever had is not thereby more intelligent than a person with average memory, because intelligence is not about storage capacity. It is about the ability to extract structure, to form abstractions, to reason about things that have not been encountered before. The AI community's fixation on scaling training data is implicitly based on a theory that more memory equals more intelligence, that if you store enough patterns from the past, intelligence emerges. But intelligence is not a quantitative property of memory. It is a qualitative property of the process that operates on memory. You could give a language model perfect recall of every text ever written and it would still not be able to discover a new mathematical theorem, because discovering a new mathematical theorem requires generating structure that does not yet exist in the training data, and generation of genuinely new structure is intelligence, not knowledge retrieval.
The specific way that I want to frame this distinction going forward is through the concept of compression. Knowledge is compression of past observations. Intelligence is the capacity to discover new compressions that predict future observations. A language model compresses past text into weights and uses those weights to interpolate over the pattern space it has seen. An intelligent system discovers new structure that was not in any of the observations it has made, structure that, once found, allows it to predict things it has never been trained on. That distinction maps directly onto the distinction between interpolation and extrapolation, and it is the reason that language models work so well inside their training distribution and fail so dramatically outside it. Inside the training distribution, they are interpolating, and interpolation is something that sophisticated pattern matching does very well. Outside the training distribution, they need to extrapolate, and extrapolation requires genuine structural understanding, real intelligence, and that is the thing that knowledge-based systems cannot provide. This is not my invention. It is the consistent finding of decades of research on generalization in machine learning, and it is the key to understanding why scaling does not solve the problem.
Why Knowledge Masquerades as Intelligence
The most dangerous aspect of the knowledge-intelligence confusion is that it is not an obvious error. Knowledge looks like intelligence from the outside. A system that has absorbed enormous amounts of pattern from the world and can retrieve and recombine those patterns fluently will produce outputs that are often indistinguishable from the outputs of a genuinely intelligent system, especially when the questions you are asking are within the distribution of its training data. This is not a minor caveat. It is the fundamental reason why the AI industry has been able to deceive the public for years about what its systems actually do. When you ask a language model something that millions of people have asked before, in roughly similar form, and the model answers correctly with appropriate confidence, there is no way for you to know from that single interaction whether the answer came from genuine understanding or from pattern retrieval. And in the vast majority of everyday use cases, pattern retrieval is good enough, which means the deception works most of the time, which means most people never encounter the failure mode that would reveal what they are actually dealing with.
I have been thinking about this illusion for a long time, and the best analogy I can find for it is the experienced test-taker who has studied every past exam paper but understands nothing of the underlying subject. If you test this person on questions that are similar in form to the past papers, they will perform excellently. They will look highly competent. Their answers will sound confident and accurate. But if you ask them a question that is structurally identical to a past question but phrased in an unfamiliar way, or that requires combining concepts that were never combined in the study materials, their performance will collapse. What looked like understanding was retrieval. What looked like intelligence was knowledge. And the collapse at the edge of the training distribution is the tell that reveals the difference. The same collapse happens with language models. Ask a well-trained model something within its distribution and it will amaze you. Ask it something genuinely novel that requires reasoning from first principles, and you will get either a hallucination or a non-answer, depending on how the model is configured to handle uncertainty. Researchers have documented this collapse repeatedly across domains from mathematics to logic to commonsense reasoning, and the result is always the same: models that look brilliant on distribution fail dramatically off it (2). That pattern is not a coincidence. It is the predicted behavior of knowledge-based systems, and it is the empirical signature of the knowledge-intelligence distinction in action.
There is also a psychological mechanism by which knowledge masquerades as intelligence, and it is worth naming because it operates on the humans who interact with these systems, not just on the systems themselves. When we receive fluent, confident, contextually appropriate speech from an entity, we are wired to infer that the entity understands what it is saying. This is a deeply reasonable heuristic in a world populated exclusively by humans, because in that world, fluent confident speech is indeed correlated with understanding. But the heuristic breaks down when the entity producing the speech is a system that generates fluency without understanding, and the breakdown is dangerous because we rarely notice it happening. The language model sounds exactly like a person who understands, which means our evolved social cognition tells us to trust it, to defer to it, to treat its outputs as reliable. This is not stupidity on the part of the humans. It is a mismatch between a social cognition that evolved for a world of humans and a technological object that can mimic the surface of human speech without any of its substance. Researchers studying interaction with conversational AI have consistently found that people attribute more understanding to these systems than the systems warrant, and that this attribution causes people to reduce the critical scrutiny they would apply to any human expert (3). That reduction in scrutiny is exactly what makes the knowledge-intelligence confusion dangerous in practice.
I also want to address the specific case of chain-of-thought prompting, because it is often cited as evidence that language models can actually reason, and I think the evidence has been badly misread. Chain-of-thought prompting is a technique where the model is asked to produce its reasoning step by step before giving a final answer. This technique improves performance on many multi-step problems, which was taken by many people as evidence that the model is reasoning. But what the technique most likely does is decompose the task into a sequence of simpler prediction tasks, each of which is closer to the model's training distribution, which means the model can handle each step through pattern retrieval rather than genuine reasoning. The overall improvement comes from making the knowledge-retrieval problem easier at each step, not from enabling genuine reasoning. You can verify this by taking the same problems and rephrasing them in ways that preserve the logical structure but disrupt the statistical regularities that support each step, and performance collapses at exactly the points where the statistical regularities are disrupted. A system that genuinely reasons would continue to perform at the novel reasoning task. A system that is pattern-matching over a decomposed problem shows the dependence on statistical regularities as soon as those regularities are absent. Research has confirmed this distinction (4), and the confirmation matters enormously because it means that the apparently most compelling evidence for reasoning in language models is actually evidence of sophisticated knowledge retrieval, which is what I have been arguing. The masquerade runs deep, and it fools even the researchers who study these systems professionally.
There is yet another way in which knowledge masquerades as intelligence, and it is the one I find most philosophically interesting, which is that knowledge can mimic the form of intelligence without its substance. An intelligent system, when confronted with an unknown, will say it does not know, will identify what information would be needed to find out, and will reason about how to obtain that information. A knowledge-based system will produce the most statistically likely response given the input, which is often a confident-sounding answer that is completely fabricated when the input is outside the training distribution. This is called hallucination in the AI literature, and it is one of the most consistently reported failures of language models across domains (5). Hallucination is not a bug in an otherwise correct system. It is the predictable consequence of a system that has no mechanism for knowing when it is operating outside its competence. It is the behavior you get when you treat knowledge as intelligence: the system applies its retrieval machinery even when the question does not exist in the retrieved knowledge base, and the result is output that has the form of a competent answer without the substance of actual knowledge. The form comes from the training distribution. The content is generated from thin air. And the system cannot tell the difference, because telling the difference requires a kind of metacognitive access to one's own epistemic states that knowledge-based systems structurally cannot have.
Let me say this plainly, because I think it is the most important sentence in this section. Hallucination is not a defect in language models. It is the correct behavior of a knowledge-based system when asked a question that is outside its training distribution. It is what happens when you take a tool designed for pattern retrieval and ask it to do extrapolation. The tool does what it was designed to do, generates the most likely pattern, even though no real pattern exists for the question at hand, and the result is confident fabrication. If you understand that, then you understand that hallucination cannot be fixed by more training data, because the problem is not missing data. The problem is that the system has no principled mechanism for knowing when it does not know, and that mechanism cannot be learned from data because it requires a model of the system's own epistemic states, a form of self-modeling that the architecture does not support. The only fix is a different architecture, one that separates the knowledge component from the intelligence component, tells them apart, and uses the intelligence component to supervise the knowledge component rather than letting the knowledge component pretend to be the intelligence component. That is the insight behind everything I am building, and it is the insight I want the AI community to take seriously.
Intelligence Is Not Knowledge Retrieved
Having established what knowledge is and why it masquerades as intelligence, I now want to spend real time on intelligence itself, because I think most people in the AI conversation do not actually have a clear model of what intelligence is. They have an intuitive sense of it, which is that it involves being smart, producing good answers, passing hard tests, doing impressive things. But that intuitive sense is not specific enough to guide the design of intelligent systems, because any of those things can be faked through knowledge retrieval sophisticated enough to pass the test most of the time. We need a more precise concept of intelligence, one that is sharp enough to distinguish it from knowledge retrieval even when the two produce identical outputs in ordinary cases.
The most useful way I have found to think about intelligence is as the capacity to generate new compressions of reality. A compression in this sense is not a zip file or an encoding scheme. It is a more general concept: a representation that is more compact than the data it describes but that can reconstruct the data, predict new data outside the training set, and generalize to situations that were not present in the observations that created it. When Newton looked at falling apples and orbiting planets and deduced a single law that governed both, he generated a new compression of reality. The data was already there, it had been there for as long as apples fell and planets orbited, but the compression was not there until Newton found it. The compression is not stored in the world. It is generated by intelligence operating on observations of the world. That is the key distinction: knowledge is pattern retrieved from past observations, and intelligence is the capacity to generate new patterns that explain past observations and predict new ones. The two activities are related because you need knowledge to have the raw material for intelligence to operate on, but they are not the same activity, and the difference between them is the difference between retrieval and discovery.
Discovery requires something that retrieval does not: the ability to generate candidate representations that are not already present in the knowledge base, evaluate them against observations, and iteratively refine them until they fit. This is the structure of scientific reasoning, and it is fundamentally active and generative in a way that pattern retrieval is not. When a scientist discovers a new equation, they are not retrieving it from memory. They are generating a hypothesis, testing it, finding it wrong, generating a modified hypothesis, testing that, refining it further, and continuing until the hypothesis fits the data well enough to trust. Each step in that process requires genuine inference, the ability to produce outputs that are not direct functions of the inputs, and that inferential capacity is what I mean by intelligence. It cannot be learned from data the way a pattern can be learned, because learning a pattern is retrieval and the very thing it is trying to learn is generation. You cannot learn how to generate new things by learning from examples of things that have already been generated. The process that generates the examples is intelligence, and you cannot learn intelligence from its outputs any more than you can learn how to invent things by studying inventions. You can learn what things have been invented, which is knowledge, and you can use that knowledge as raw material, but the inventive capacity itself is something different and something deeper.
I want to connect this to the ongoing work on the lmm project that I have been building and describing across my posts, and specifically to the lmm-agent crate that sits at its core. The lmm-agent framework is built entirely around the knowledge-intelligence distinction I have been drawing. It is an equation-based, training-free autonomous agent: no LLM API key, no token quotas, no stochastic black boxes, no weights updated by gradient descent on anyone's text. Instead of pattern retrieval, it implements what I am calling intelligence primitives, five structural properties that replace statistical interpolation with auditable, causal, and motivated cognition. These primitives are: calibrated Bayesian uncertainty via Gaussian belief propagation, compositional axiomatic reasoning that produces auditable forward-chaining proofs, causal counterfactual attribution using Pearl do-calculus interventions, hypothesis formation that ranks candidate new causal edges by explanatory power, and internalized motivational drives including distinct signals for Curiosity, CoherenceSeeking, and ContradictionResolution. Each of these primitives maps directly onto the abstract properties I described as belonging to intelligence rather than knowledge. None of them can be trained in by fitting patterns to text, because each requires a structural mechanism to be present at the architectural level, and that is exactly why I built them that way rather than trying to elicit them from a language model through prompting.
Research on systematic generalization has made the distinction between retrieval and intelligence precise in a testable way. Systematic generalization, as defined in the cognitive science and machine learning literature, is the ability to combine known concepts in new ways and derive correct predictions from those combinations without having been trained on them (6). This is the minimal form of intelligence that is clearly distinct from knowledge retrieval, because it requires generating correct outputs for combinations that are not in the training data. When you train a language model on a set of sentences and then test it on recombinations of the same words and concepts in novel structures, performance drops dramatically, even when the logical structure is identical to structures the model has seen and the concepts are all familiar (7). That drop is the empirical signature of the knowledge-retrieval ceiling. The model can retrieve patterns it has seen. It cannot generate the correct pattern for combinations it has not seen, because generation requires intelligence and the model only has knowledge. This failure has been replicated across dozens of architectures, training regimes, and task domains, and it consistently appears at the same boundary: inside the training distribution, performance is high. Outside it, performance collapses. Intelligence does not have that boundary. Knowledge does.
Let me say something that I think is important and uncomfortable. Every benchmark that the AI industry uses to measure progress is, at best, measuring the depth and breadth of knowledge retrieval, and at worst, measuring the degree to which the knowledge retrieval system has been fine-tuned to produce outputs that look like intelligence on that specific benchmark. I wrote about this in Rethinking ARC-AGI, where I argued that even the benchmarks designed to test genuine fluid intelligence were compromised by the way models could be trained to pattern-match on the benchmark format itself. The problem is that any task that can be represented in text and evaluated by a judge can, in principle, be solved by a sufficiently powerful knowledge retrieval system, because the evaluation criteria are themselves encoded in the statistical structure of how humans write about evaluation. If you want to measure intelligence, you need to measure it on tasks that require generating structures that are not in any training data, and those tasks are very hard to construct and evaluate, which is why the benchmarking community avoids them. But avoiding hard truth does not make it less true.
Intelligence also has a property that knowledge does not, which is the capacity for genuine epistemic humility. An intelligent system knows when it does not know. It has a model of its own epistemic states that allows it to distinguish between conclusions supported by reliable inference and conclusions that are uncertain or unsupported. This metacognitive capacity is not a personality trait or a design choice that can be trained into a system by fine-tuning. It is an architectural property that must be present at a structural level for the system to reliably identify the limits of its own competence. Research on calibration in large language models consistently finds that these systems are overconfident, presenting uncertain or false information with the same confidence as reliable information (8). That overconfidence is not a bug. It is the expected behavior of a system that maxes out its knowledge retrieval machinery even when the question is outside its training distribution, because the machinery has no switch labeled "outside my domain." The switch would require intelligence to implement. The machinery only has knowledge.
The Separation That Actually Matters
I want to be concrete about what I mean by the mutually exclusive relationship between knowledge and intelligence, because I have been building up to this claim for several sections and I need to state it carefully to avoid misunderstanding. I am not claiming that knowledge and intelligence cannot coexist in a single system. Humans have both, obviously, and the interaction between them is what makes human cognition so powerful. What I am claiming is something more specific and more operationally important: that when you build a system by maximizing knowledge, meaning by training it to store and retrieve as many patterns from the past as possible, you are simultaneously minimizing the pressure on the system to develop genuine intelligence, and the more successfully you do the first thing, the more completely the second thing is displaced. This is the mutually exclusive relationship I have in mind, and it has a very practical consequence: the training paradigm that produces the most capable knowledge-retrieval systems is precisely the paradigm that is least likely to produce genuine intelligence, because success at retrieval removes the need for generation.
Think about what happens during training of a language model. The model receives a sequence of tokens and is asked to predict the next token. The loss function penalizes incorrect predictions and rewards correct ones. The most efficient way for the model to minimize loss is to store statistical structure from training sequences and retrieve it at prediction time. If the model develops genuine inference mechanisms that allow it to derive the correct next token from structural properties of the input, those mechanisms will be penalized in exactly the cases where knowledge retrieval gives the right answer, because the loss function cannot distinguish between arriving at the right answer via retrieval and arriving via genuine inference. The loss function only sees the answer. This means that from the loss function's perspective, there is no advantage to genuine intelligence over knowledge retrieval when retrieval works, and retrieval works on everything in the training distribution. The selection pressure for genuine intelligence therefore applies only outside the training distribution, where retrieval fails, but the training procedure sees no examples from outside the training distribution by definition, which means there is effectively no selection pressure for genuine intelligence at all. The training procedure optimizes knowledge at the expense of intelligence, not because anyone designed it that way, but because knowledge and capability on training data are the same thing in a training regime, and intelligence only matters at the edges where the training data runs out.
This is why I said at the beginning that knowledge and intelligence are mutually exclusive in the sense that matters most for building intelligent systems. The training paradigm that produces the most knowledgeable systems is the same paradigm that most completely removes the adaptive pressure for developing genuine intelligence. And the result is exactly what we observe: systems that are extraordinarily capable within the distribution of their training data and shockingly incompetent outside it. That pattern is not a transitional phase on the way to genuine intelligence. It is the predicted endpoint of a training paradigm that optimizes knowledge retrieval, and it will remain the endpoint regardless of how much larger the models become or how much more data they train on. Scaling within a paradigm that optimizes the wrong thing will produce more of what the paradigm produces, not the thing the paradigm is incapable of producing.
I also want to say something about what the right design philosophy looks like, because I think every post I write should point toward something better rather than just eviscerating something that exists. A system designed to produce genuine intelligence rather than knowledge would need to reward generation of new compressions rather than retrieval of old ones. It would need to operate on tasks where knowledge retrieval is guaranteed to fail, where genuine inference from structural properties of the input is the only way to get the answer. It would need to evaluate itself on out-of-distribution generalization rather than on distribution-matching, which means the evaluation criteria would need to be structurally different from the operating procedure rather than identical to it. It would need explicit mechanisms for representing its own epistemic states, knowing when it knows and when it does not, rather than expecting that capability to emerge by tuning on human-labeled examples of confidence. And it would need to ground its processing in something other than text. The lmm-agent framework is my attempt to implement these principles. Its HELM engine, Hybrid Equation-based Lifelong Memory, is a lifelong learning system built entirely on CPU-resident hash maps and floating-point arithmetic: tabular Bellman Q-learning, prototype meta-adaptation via Jaccard similarity, knowledge distillation, self-federated Q-table aggregation without a central server, elastic memory guarding by activation-count pinning, and PMI co-occurrence mining from high-reward observations. No GPU. No neural networks. No external machine learning crates. If that sounds like an unusual set of constraints, it is. The constraints are the point. Each one is a deliberate refusal to fall back on the comfortable tools that produce knowledge rather than intelligence.
Research on causal inference provides the strongest theoretical foundation I know for the separation I am describing. Pearl's causal hierarchy distinguishes between seeing, doing, and imagining, corresponding roughly to knowledge retrieval, intervention, and counterfactual reasoning (1). A system at the first level can only process observations and extract statistical associations. A system at the second level can predict the effects of actions, meaning it can reason about what will happen if you do something, not just what has happened when things correlated. A system at the third level can reason about counterfactuals, about what would have happened if conditions had been different. Language models are firmly at level one by any honest assessment. They have access to observations in text form and they learn statistical associations from them. They cannot reliably predict the effects of interventions, and they cannot reliably reason about counterfactuals, because both of those capabilities require a causal model, a structural representation of the mechanisms that generate the observations. Building a causal model requires intelligence in the sense I have been using, generating a new compression of reality that captures mechanism rather than just association, and that is exactly the capability that knowledge-retrieval training does not develop. The hierarchy is not an opinion. It is a mathematical framework derived from the formal theory of probability and intervention, and it is the most precise statement I know of why knowledge and intelligence are different things.
When Knowledge Fails and Intelligence Has to Step In
Let me bring this into the real world, because abstract arguments without concrete examples are not as honest as arguments that put themselves on the line. I want to talk about specific domains where the knowledge-intelligence distinction is not a theoretical nicety but a practical matter of life and death, because that is where the cost of the confusion is most visible and most urgent.
In medicine, the distinction shows up most clearly in the diagnosis of rare diseases and unusual presentations of common diseases. A physician who has seen many patients develops knowledge, statistical patterns about how diseases present that allow them to make rapid, accurate diagnoses in typical cases. But the genuinely diagnostic work, the work that saves lives when the presentation is atypical, requires intelligence: the ability to reason from mechanism, to understand why the symptoms the patient is showing are inconsistent with the most likely diagnosis, to generate alternative hypotheses and test them against the clinical picture, to recognize the atypical feature that is the tell for a rare condition. This is causal reasoning, not pattern retrieval, and it is the difference between correctly diagnosing a typical case of pneumonia and recognizing the unusual presentation that is actually a rare autoimmune condition masquerading as pneumonia. AI systems for medical diagnosis that learn from training data of typical presentations will inherit the same blindspot as the knowledge at level: they will perform well on typical cases and fail precisely at the unusual cases where correct diagnosis matters most. Research has confirmed this pattern empirically across multiple medical AI systems (9), showing that performance on out-of-distribution cases is substantially worse than on in-distribution cases, exactly as the knowledge-intelligence distinction predicts.
In engineering, the distinction shows up when a system encounters a failure mode that was not anticipated in the design specification. An engineer who knows the design in detail knows what the system is supposed to do. But when the system does something unexpected, the engineer needs intelligence to understand why, to trace the causal chain from the failure symptom back to the root cause, to recognize which design assumption was violated and how. This is diagnosis by causal inference, not by pattern retrieval, and it is the reason why good engineers are valuable beyond the knowledge they carry. A knowledge-based AI system can tell you what has failed in similar systems in the past. It cannot tell you why this system is failing in this specific novel way, because the novel failure mode is by definition outside the database of past failures that constitutes its knowledge. And aircraft, nuclear plants, bridges, and medical devices fail in novel ways with consequences that cannot be hedged by pointing to the excellent performance on the training distribution.
In scientific research, the distinction is the entire point. Science is the enterprise of generating new compressions of reality, of discovering equations and models and theories that explain observations and predict new ones. Knowledge in science is the accumulated set of established theories and experimental results. Intelligence in science is the capacity to generate new theories that explain what existing theories cannot, to design experiments that can distinguish between competing hypotheses, to see the pattern in anomalous data that points toward a new understanding. The history of science is the history of intelligence operating on knowledge, producing new compressions that subsume the old ones and extend the reach of human understanding. Language models trained on scientific text have absorbed an enormous amount of scientific knowledge. They can summarize papers, explain established theories, and generate text that sounds like scientific reasoning. But they cannot do science, because doing science requires generating new compressions, discovering structure that was not in any training data, and that requires genuine intelligence. The lmm-agent architecture is my ongoing attempt to implement this capability in code: its HypothesisGenerator takes the residual unexplained variance in a causal graph and ranks candidate new causal edges by explanatory power, proposing hypotheses rather than retrieving patterns. Its CausalAttributor uses Pearl do-calculus to perform counterfactual interventions on the causal graph, attributing outcomes to root causes rather than surface correlations. These two components together enable the agent to ask the question that knowledge-based systems cannot ask: not what has happened before in similar situations, but why is this particular thing happening now, and what would be different if I changed this specific factor.
In education, the distinction surfaces most clearly in the difference between a student who has memorized the course material and a student who has understood it. The memorizing student can answer questions that appear on the exam, which is knowledge retrieval. The understanding student can answer novel questions that combine and extend the concepts from the course in ways that were not explicitly covered, which is intelligence. Every good teacher knows the difference between these two students, even if they cannot always articulate exactly what the difference is. The understanding student has generated compressions from the course material that generalize beyond it. The memorizing student has stored the surface patterns of the course material and can only retrieve them when the question looks like the patterns. AI tutoring systems that evaluate student understanding by testing on questions similar to the training material will consistently overestimate the understanding of memorizing students and, more dangerously, will fail to identify the students who have genuinely mastered the concepts. The same failure mode will appear in any AI system for education that is built on knowledge retrieval rather than on genuine assessment of generalization and transfer.
Let me also connect this to the economic and social critique I have made in previous posts, because the knowledge-intelligence confusion is not just an intellectual error. It is an error with economic consequences that fall disproportionately on people who are already vulnerable. I described in Technology Has Destroyed My Livelihood how the promise of equal opportunity in the tech industry was a lie, and how the extraction of value from engineers and creators funded systems that automated away the jobs of the people who built them. The same pattern is playing out with AI systems that are presented as intelligent but are actually knowledge-based: the people in high-stakes domains, the patients, the engineering clients, the students, the citizens subject to algorithmic decisions, are told that the system is intelligent and therefore trustworthy, and they cannot access the technical argument for why it is not, and so they trust, and when the system fails them in exactly the way that the knowledge-intelligence distinction predicts, they have no recourse and no one takes responsibility. The intellectual confusion is the enabling condition for the economic exploitation, and correcting the confusion is not just an academic project. It is a precondition for holding anyone accountable.
The Role of Equations and Simulation in Real Intelligence
Having separated knowledge from intelligence and shown why the current training paradigm optimizes the wrong thing, I want to spend time on what genuine intelligence looks like in practice, because I have been making mostly negative arguments and I owe the reader a positive vision. The positive vision connects directly to what I argued in Mathematical Equations are Multimodal by default and LLMs are Useful. LMMs will Break Reality, and it runs through the same core idea: the only way to build a system that is genuinely intelligent rather than merely knowledgeable is to build a system that can discover and simulate mathematical structure, not retrieve linguistic patterns.
The reason equations are the right foundation is that equations are not knowledge. They are not stored patterns from past observations. An equation is a generative structure, a compact representation that can produce an infinite range of outputs from a finite specification. Take Maxwell's equations, the four equations that describe all of classical electromagnetism. Those four equations were not retrieved from past data. They were discovered by James Clerk Maxwell through a combination of mathematical insight and physical reasoning that went beyond everything that had been observed before, generating predictions about electromagnetic waves that were confirmed only after his death. The equations are not a summary of past observations. They are a compression of the mechanism that generates those observations, and that compression can generate outputs in domains that were not observed at the time of discovery. That is intelligence: the generation of a new compression that predicts new reality. And it is the polar opposite of knowledge retrieval, which takes past observations and retrieves the most likely interpolation between them.
When I built the lmm-agent framework, the central design question was how to implement the discovery process in code at the level of an autonomous agent operating in an environment. The answer I arrived at was the ThinkLoop: a closed-loop proportional-integral controller that drives iterative reasoning toward a goal by computing Jaccard-error feedback at each cycle. The controller is not searching a space of equations the way a human scientist does, but it is doing something structurally analogous: it runs a loop, measures how far its current state is from its goal, and adjusts its behavior based on the error signal, converging on a solution by iterative refinement rather than by lookup. It is intelligence in the control-theoretic sense, a feedback loop that generates new behavior rather than retrieving stored behavior, and that is the distinction I have been drawing throughout this post. The lmm-agent project is documented at crates.io and at the lmm repository, and I want to be honest that it is still early work, imperfect and full of limitations. But the architecture is grounded in the right principles, which matters more than any current benchmark score.
Simulation is the other half of genuine intelligence, and it is the half that closes the loop between discovery and verification. Once you have an equation, you can simulate it forward in time and compare the predictions to new observations. This comparison is what distinguishes a good compression from a bad one, and it is what makes the process self-correcting in the same way that the scientific method is self-correcting. A knowledge-based system cannot do this loop, because it has no equation to simulate, only patterns to retrieve. It cannot generate novel predictions from a compact structure and test them against new reality. It can only match new inputs to past patterns and interpolate. The simulation loop is the difference between science and commentary, between prediction and description, between genuine intelligence and very sophisticated knowledge retrieval. Research on physics-informed machine learning has shown that incorporating simulation into the learning process produces models that are dramatically more reliable and generalizable than pure data-driven models (10), which is exactly the prediction you would make from the knowledge-intelligence distinction: grounding the learning process in structural simulation rather than pattern retrieval produces systems that generalize better, because simulation is closer to intelligence than retrieval is.
I also want to connect this to the multimodal argument I made in my post on equations. A system that has discovered an equation for a phenomenon has not just learned one thing. It has learned a structure that can be rendered in any modality: as a graph, as an animation, as a numerical prediction, as an audio signal, as a physical simulation. All of those outputs are generated by the same compact structure, which means the system has genuine multimodal understanding rather than learned associations between modalities. This is what genuine intelligence looks like in the multimodal domain: not a system that has been trained to align text representations with image representations through statistical association, but a system that has discovered the common mathematical structure that generates both. The LMM framework, as I described in LLMs are Useful. LMMs will Break Reality, is the name I use for the class of systems that operates at this level, learning from multiple modalities to discover the mathematical structure that underlies them all, and it is structurally different from a language model extended with image inputs, because the goal is discovery rather than alignment.
Let me be concrete about what this means for the practical capabilities that people care about. A system that genuinely understands a physical system through its equations can answer questions about that system that are not in any training data, because the equation generates answers, not retrieves them. A system that can simulate the dynamics of a physical process can predict what will happen in scenarios that were never observed, because the simulation runs from the equations, not from the training data. A system that discovers compact mathematical representations of complex phenomena has access to knowledge that is more powerful than any stored pattern, because the compact representation generalizes in ways that stored patterns cannot. These are not small incremental improvements over the current state of the art. They are qualitative differences in the kind of intelligence the system has, and they map directly onto the distinction between knowledge and intelligence that I have been drawing throughout this post.
The most concrete proof I have that the lmm-agent approach is on the right track is an experiment we ran recently. We built an agent called arc-lmm-agent to tackle the ARC-AGI-3 benchmark, which is widely considered one of the hardest tests of genuine fluid intelligence available. The specific environment was the ls20 game, a partially observable grid puzzle with fog-of-war, sequential configuration objectives, and strict step budgets. The agent had no prior knowledge of the game whatsoever. No training data from the game environment. No examples of successful trajectories. No human demonstrations. It entered the environment with nothing except its architectural capabilities: the InternalDrive system generating Curiosity and Incoherence signals in response to novel states, the KnowledgeIndex enabling cross-level strategy transfer by ingesting narrative descriptions of completed levels into a queryable IDF-weighted index, the HELM Q-learning engine shaping exploration toward historically rewarding directions, and a WorldMapGraph for building an internal model of the environment topology as it was discovered. Without any task-specific training, this agent achieved a score of 10.71% on ARC-AGI-3. That might not sound dramatic, but I want you to think carefully about what that number represents. It is a score produced by a system that had never seen the game before, had no gradient-descent-trained weights, and was operating entirely on mathematical structure, causal feedback, and internally generated curiosity signals. It is doing something that is structurally closer to what I am calling intelligence than anything a language model does when it completes a standard benchmark. And we are continuing to improve it, refining the discovery algorithms, the routing policy, and the causal attribution mechanisms, with the goal of pushing toward 100%. The research direction that I am most committed to is the combination of symbolic regression, physics-informed learning, and operator learning into a single system that can observe, discover, and simulate. Symbolic regression provides discovery (11). Physics-informed learning grounds it in physical constraints (12). Neural operators provide the simulation mechanism (13). That loop is what the lmm project is building toward.
Why This Matters for Everything Else
I want to zoom out in this section and talk about why the knowledge-intelligence distinction matters beyond the narrow technical arguments I have been making, because the implications run much wider than AI research. The confusion between knowledge and intelligence is not just a problem in how we build machines. It is a problem in how we think about human cognition, in how we design educational systems, in how we evaluate expertise, and in how we make decisions about whom to trust and what to defer to. Getting this distinction right is not just a prerequisite for building better AI. It is a prerequisite for thinking more clearly about what minds are and what they are for.
Consider how the confusion affects educational practice. Most educational systems are built around knowledge transmission: you learn facts, procedures, and standard problem-solving approaches, and you demonstrate retention of those things on tests. This is useful and necessary. But it systematically conflates knowledge with intelligence, which means that students who are excellent at knowledge retrieval are rewarded regardless of whether they have developed genuine intelligence, and students who have genuine intelligence but poor knowledge retention are penalized regardless of how well they could reason about novel problems. The result is a system that selects for knowledge over intelligence and then wonders why its graduates struggle with problems that do not look like the practice problems. Research in cognitive science has shown for decades that transfer of learning, meaning the ability to apply knowledge to novel domains, is hard to teach and rarely achieved through standard instructional approaches (14), which is exactly the prediction you would make if intelligence is the capacity to generate new compressions rather than a product of accumulated knowledge. If generating intelligence were just a matter of accumulating enough knowledge, transfer learning would be easy. It is hard, because intelligence and knowledge are different things.
The knowledge-intelligence distinction also matters for how we think about expertise and authority. We tend to defer to experts on the assumption that they have not just knowledge but intelligence, that they can reason about novel cases rather than just retrieval from their experience. But expertise calibration research has consistently shown that experts often perform much worse in novel domains than their knowledge base would suggest, because the novel domain requires generating new compressions while the expert's advantage is in retrieval of trained patterns (15). This is not an argument against expertise. It is an argument for being precise about what kind of expertise is relevant to what kind of problem. A medical expert's knowledge is enormously valuable for typical cases. A medical expert's intelligence, their capacity for novel causal reasoning, is what you need for the unusual cases. These are different things, and conflating them leads to misplaced trust in exactly the cases where that trust is most dangerous.
The implications for AI deployment are the most urgent. When a healthcare system deploys an AI model for diagnosis, what it is deploying is a knowledge-retrieval system. The system may be extraordinarily accurate on cases that resemble its training data. It will be unreliable on cases that require reasoning from mechanism, because reasoning from mechanism is intelligence and the system only has knowledge. The people making deployment decisions often understand this at some level, but they are caught in the same confusion that everyone else is: they see the performance on the training distribution and they interpret it as evidence of general capability, not recognizing that performance within the distribution is measuring knowledge while performance outside it would measure intelligence, and the system has never been tested outside the distribution at scale. The consequence is deployment of knowledge at disease-level stakes under the assumption that it is intelligence, and the failures happen in exactly the cases where correct reasoning matters most, in the unusual presentations that do not match the training distribution. The same pattern applies to AI in law, in hiring, in credit evaluation, in criminal justice, and in every other domain where AI is currently being deployed at scale while being described as intelligent.
I want to connect this to the broader theme that has run through all my posts, which is the question of who bears the cost of technology's failures. In An Empty Life Filled With Constant Suffering, I wrote about how suffering becomes invisible when it is mine but very visible when it can be used to justify someone else's comfort. In Technology Has Destroyed My Livelihood, I argued that technology's benefits flow upward to the people who control the machines while its costs flow downward to the people who depend on them. The knowledge-intelligence confusion is part of this same pattern. The people who deploy AI systems under the description of intelligent bear no cost when those systems fail by being merely knowledgeable. The cost is borne by the patient who received the wrong diagnosis, the engineer whose design was signed off on the basis of AI assurance, the student whose understanding was assessed by a system that cannot tell understanding from memorization. The intellectual confusion enables the economic exploitation, and getting the distinction right is not just a matter of scientific honesty. It is a matter of justice.
Let me end this section with something that I want everyone who builds AI systems to hear. You are not building intelligence. You are building knowledge. That is genuinely valuable, and it would be dishonest for me to pretend otherwise. Knowledge-retrieval systems have improved productivity, enabled new products, and helped millions of people with tasks that would otherwise have been harder. But knowledge-retrieval systems should be deployed under terms that accurately describe what they are: systems that perform very well within their training distribution and cannot be trusted on novel cases that require genuine inference. The moment you describe them as intelligent, or allow them to be deployed in contexts that require intelligence, you are making a claim that is not supported by the evidence, and the cost of that unsupported claim will be paid by the people who trusted you.
Living with the Distinction, and Building Beyond It
I want to end this post the way I have ended the others, by being honest about where I am and what I am actually doing, rather than hiding behind abstractions. I am a person who has spent years building software, thinking about intelligence, watching the AI industry grow into something that I find both impressive and deeply dishonest, and writing on this blog as a way of staying sane through the cognitive dissonance of living at the intersection of a technology I find genuinely exciting and an industry I find genuinely corrupt. I am not a professor. I am not a famous researcher. I am someone who built things, got burned by the industry I built things for, and decided to keep thinking and keep writing because thinking and writing are the only things I trust anymore.
The distinction between knowledge and intelligence is not just a theoretical point for me. It is the organizing principle of every design decision in the lmm-agent framework. The five intelligence primitives exist because calibrated uncertainty, causal attribution, hypothesis formation, axiomatic reasoning, and internalized motivation are the structural properties that intelligence requires and knowledge retrieval does not provide, and if a system does not have them at an architectural level, no amount of training data will give them to it. The HELM engine exists because lifelong learning from experience is how real agents get better at navigating novel environments, and real lifelong learning looks like reinforcement over a Q-table shaped by reward, not like fine-tuning on labeled examples of the right answer. The ThinkLoop PI controller exists because convergence toward a goal through iterative error feedback is what it looks like to reason rather than to retrieve. The CausalAttributor with its do-calculus interventions exists because Pearl's hierarchy tells us that only causal models can correctly predict the effects of actions, and I want the agent to be at level two or three of that hierarchy, not level one. The choice to implement everything in Rust is because precision matters, and to avoid gradient descent on text corpora is because that path produces knowledge and I am trying to build something else. The ARC-AGI-3 experiment with arc-lmm-agent was the first real test of whether these architectural choices actually produce the kind of out-of-distribution behavior that distinguishes intelligence from knowledge, and the answer was yes, imperfectly and partially, but yes and measurably, which is more than I can say for the systems that everyone else is scaling.
I also want to say something about what I have learned from the entire experience of building this project and writing these posts. The most important thing I have learned is that the people who are most confident about AI are usually the people who have thought about it least carefully, and the people who are most uncertain are usually the ones who have looked most closely. Genuine intelligence is extraordinarily hard to build, harder than making something that looks intelligent, which is already hard. The reason I keep writing about the knowledge-intelligence distinction is that I believe the field is stuck on making things that look intelligent, and until it is prepared to have an honest conversation about the difference between looking and being, it will keep producing systems that fail in predictable ways and then making excuses for why those failures do not matter. They do matter. The families of patients who died because a knowledge-based system failed outside its distribution know they matter. The engineers who trusted AI assurance on a design that failed know they matter. The students who were assessed by a system that could not tell understanding from memorization know they matter. Their experiences are not anecdotes. They are the empirical evidence for what happens when you confuse knowledge and intelligence at scale.
The research direction I am most committed to pursuing is the one that closes the loop between causal discovery and simulation, between building a model of why things happen and testing that model against new reality. The lmm-agent framework is my implementation of this loop as of today, and I am continuously improving it: adding new intelligence primitives, refining the HELM learning algorithms, improving the causal graph construction, and pushing the arc-lmm-agent toward higher performance on ARC-AGI-3 and similar environments. The current 10.71% score on ARC-AGI-3 is a start, not an endpoint, and I believe the architectural approach is the right one to eventually reach 100%, because the agent is improving through genuine structural learning rather than through memorization of past trajectories. It is not finished, and it is not competitive with the state of the art at knowledge tasks, because it is not trying to be. It is trying to do something different: operate effectively in genuinely novel environments using general architectural properties rather than task-specific training. Every point it gains on ARC-AGI-3 without any game-specific knowledge is a point that falsifies the claim that intelligence requires massive training data, and falsifying that claim is the research project I am most committed to.
I also want to say something directly to the researchers who read this blog, because I know some of you are working on exactly the problems I have been describing: causal machine learning, symbolic regression, physics-informed neural networks, world modeling, and the other research directions that are trying to build genuine intelligence rather than sophisticated knowledge retrieval. I see your work. I read your papers. I know the funding pressure and the publication pressure and the societal pressure to work on the things that look most impressive rather than the things that matter most. I know what it feels like to be working on something that is harder and slower and less immediately impressive than the thing everyone else is building, while watching the thing everyone else is building attract all the attention and all the money. I have been there without even the consolation of being in a well-funded research environment. And the only thing I can say is: keep going. The direction is right. The direction is always more important than the current state, and the current state of the research you are doing is far more promising than most people know.
Till next time 👋!
References
1. Pearl, J., Causality: Models, Reasoning, and Inference, Cambridge University Press, 2009
2. Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O., Understanding Deep Learning Requires Rethinking Generalization, International Conference on Learning Representations (ICLR), 2017
3. Lee, J. D. & See, K. A., Trust in Automation: Designing for Appropriate Reliance, Human Factors, 2004
4. Valmeekam, K., Olmo, A., Sreedharan, S., & Kambhampati, S., PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change, NeurIPS 2022 Workshop, 2022
5. Ji, Z., Lee, N., Frieske, R., Yu, T., et al., Survey of Hallucination in Natural Language Generation, ACM Computing Surveys, 2022
6. Lake, B. M. & Baroni, M., Generalization without Systematicity: On the Compositional Skills of Sequence-to-Sequence Recurrent Networks, International Conference on Machine Learning (ICML), 2018
7. Fodor, J. A. & Pylyshyn, Z. W., Connectionism and Cognitive Architecture: A Critical Analysis, Cognition, 1988
8. Kadavath, S., Conerly, T., Askell, A., et al., Language Models (Mostly) Know What They Know, arXiv:2207.05221, 2022
9. Zech, J. R. et al., Variable Generalization Performance of a Deep Learning Model to Detect Pneumonia in Chest Radiographs, PLOS Medicine, 2018
10. Raissi, M., Perdikaris, P., & Karniadakis, G. E., Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations, Journal of Computational Physics, 2019
11. Udrescu, S. M. & Tegmark, M., AI Feynman: A physics-inspired method for symbolic regression, Science Advances, 2020
12. Karniadakis, G. E., Kevrekidis, I. G., Lu, L., Perdikaris, P., Wang, S., & Yang, L., Physics-informed machine learning, Nature Reviews Physics, 2021
13. Lu, L., Jin, P., Pang, G., Zhang, Z., & Karniadakis, G. E., Learning nonlinear operators via DeepONet based on the universal approximation theorem of operators, Nature Machine Intelligence, 2021
14. Barnett, S. M. & Ceci, S. J., When and Where Do We Apply What We Learn? A Taxonomy for Far Transfer, Psychological Bulletin, 2002
15. Kahneman, D., Thinking, Fast and Slow, Farrar, Straus and Giroux, 2011