

All You Have Access To Is Knowledge and Tools; Never Intelligence!
#all-you-have-access-to-is-knowledge-and-tools-never-intelligenceHey everyone 👋,
In my previous posts I have been building one argument from many different angles, and if you have been reading along, you already know where I keep landing. In Language is Limited. ASI is Impossible., I said that words are not thoughts and that any machine trapped inside symbols is trapped inside a cage it cannot see. In Mathematical Equations are Multimodal by default, I argued that the only honest language for describing reality is mathematics, because equations encode mechanisms while text only encodes descriptions. In Training Is an Evil Concept. LMMs Eliminates it Altogether., I went further and argued that the training paradigm itself is a form of extraction that concentrates value in the wrong hands while pretending to be progress. In Rethinking ARC‑AGI, I showed that even the benchmarks people use to celebrate AI progress are essentially measuring the wrong thing, rewarding pattern matching while calling it reasoning. All of these posts have been getting closer to something I want to say plainly in this one, something I have been circling for a long time without just saying it in the most direct language I can manage. The thing is this: every AI system you have ever used, every assistant, every chatbot, every agent, every copilot, every oracle dressed up in a sleek interface, has access to knowledge and has access to tools, but none of them, not one, has intelligence in the sense that word deserves to carry. I do not mean that as a small technical qualifier. I mean it as the central fact about what modern AI actually is, and I think almost nobody in the mainstream conversation is saying it this directly, and so I am going to say it here, as carefully and as plainly as I can, and let the argument carry its own weight.
I want to be precise about what I mean by each of those three words, because precision is the only thing that separates a real argument from a confident-sounding opinion. Knowledge means stored information, patterns absorbed from data, associations learned between inputs and outputs, facts held in weights or retrieved from an index. Tools means external functions a system can call: a search engine, a calculator, a code interpreter, a database query, an API that returns real-time data. Intelligence, and this is the one that matters, means the capacity to generate new understanding from first principles, to construct a model of the world you have not been given, to reason about what you have never seen using structure that was never stored anywhere. Knowledge is a library. Tools are a set of instruments. Intelligence is the thing that knows which book to look for, and why, and what to do when the book does not exist. Current AI systems have the first two in abundance. They have the third one not at all. That is the argument. Everything else in this post is evidence and elaboration.
What Knowledge Actually Is, and Why It Is Never Enough
I want to start with knowledge because it is the thing people are most likely to confuse with intelligence, and the confusion is understandable because knowledge looks impressive from the outside. When you ask a language model to summarize the entire history of the Roman Empire, and it gives you a fluent, well-organized, apparently comprehensive answer in seconds, it is very hard not to feel like you are talking to something that genuinely knows history. It sounds like it knows. It has the vocabulary, the dates, the names of the emperors, the sequence of events, the causes and consequences, the historiographical debates. A person who knew all of that would be called learned, and that would be a form of praise. But the model's relationship to that information is fundamentally different from the relationship a learned person has to the same material, and the difference is the whole point of this post, and it is not a subtle or philosophical difference that only matters in edge cases. It is a structural difference that shows up every time you ask the system to do something genuinely new.
A person who has studied Roman history has built a model of that world inside their mind. They have connected the economic pressures of the third century to the political instability, connected the political instability to the military dependence on mercenaries, connected that dependence to the erosion of civic identity, and connected that erosion to the eventual fragmentation of the Western Empire. That model was not stored. It was constructed, over time, through a process of active engagement with evidence, through reading, debating, questioning, revising, and testing the model against new information. The model lives as a dynamic structure inside the historian's mind, and the historian can use it to answer questions that were never asked during the learning process, because the model is generative. It can produce new answers from old understanding. This is what I mean when I say knowledge is not intelligence. Knowledge is the raw material that intelligence works on, but the working itself is a separate capacity, and current AI systems have the raw material in vast quantities without having the capacity to work on it in the deep sense.
The technical basis for this claim is not controversial within the research community, even though it rarely makes it into the popular conversation about AI. Large language models learn to predict the next token in a sequence, which means they learn a compressed representation of the statistical patterns in their training data, and that representation is genuinely extraordinary in its scope and detail. But what the representation encodes is the surface structure of human knowledge as expressed in text, not the underlying conceptual structure that the text was pointing at. Think of it this way. A transcript of every lecture ever given at every university is not the same as the understanding that those lectures were trying to convey. The transcript is the shadow of the understanding, and you can learn a lot from the shadow, and shadows are useful, but a shadow is not the thing itself. Research by Bender, Koller, and colleagues in the landmark Stochastic Parrots paper showed definitively that form, meaning the statistical structure of language, does not determine meaning, meaning the connection between language and the world, and that a model trained to predict form will not automatically learn meaning no matter how large it becomes (1). This is the theoretical foundation of what I am arguing, and it is a solid foundation because it is grounded in a careful analysis of what training on text can and cannot achieve as a matter of principle, not just as an empirical observation about current model behavior.
I have seen people respond to this argument by saying that the knowledge stored in these systems is still useful, and that is true, and I am not denying it. I said it in As Engineers, LLMs should pay us for tokens usage, and I stand by it: these systems are useful tools that help people do real work faster. What I am disputing is not their utility but the narrative built around them, the claim that the knowledge they contain is evidence of intelligence, that their fluent retrieval of stored patterns constitutes understanding, and that more knowledge plus better retrieval will eventually add up to genuine thinking somewhere along the way. That narrative is what I am calling out as false, because it conflates the map with the territory, the description with the thing described, the storage with the understanding. A system that can retrieve the fact that gravity causes objects to fall is not the same as a system that understands gravity, and the difference is not a matter of degree. It is a matter of kind. Newton did not merely retrieve the fact that apples fall. He discovered the law that made the falling predictable across every possible case, including cases nobody had ever observed, and that discovery was intelligence operating on knowledge, not knowledge by itself magically becoming intelligence through scale or polish.
Let me be very concrete about where this distinction matters most, because abstract arguments always benefit from landing in a specific place. Medical diagnosis is one of the most important domains in which AI is currently being deployed, and it is a domain that clearly illustrates the difference between knowledge and intelligence. A language model trained on medical literature has access to an enormous amount of clinical knowledge, symptoms, lab values, disease mechanisms, treatment protocols, drug interactions, and outcomes from millions of cases. When given a patient presentation, it can produce a differential diagnosis that looks impressive and often includes the right answer. But producing a list of possibilities from pattern matching is not the same as understanding why this particular patient, with this particular history, this particular constellation of risk factors, and this particular presentation, is more likely to have one disease than another. That reasoning requires a causal model of how diseases produce symptoms, how risk factors modify probability, how the timeline of symptom onset constrains the diagnosis, and how the biological mechanisms interact. A language model does not have that causal model. It has the text that describes that causal model, and those are different things, and the difference can cost lives when the system is wrong in a case that falls outside the patterns it learned. Research published in The Lancet Digital Health has documented that even high-performing AI diagnostic systems show systematic failures on atypical presentations precisely because they are matching patterns rather than reasoning causally (2). The knowledge is there. The intelligence is not.
I also want to say something about the problem of knowledge staleness, because it is another dimension of why knowledge alone is never enough. A language model's knowledge has a cutoff date, after which the world has changed but the model has not. That cutoff problem is often presented as a technical limitation that retrieval augmented generation can fix, and that framing misses the deeper issue. Even with perfect retrieval of up-to-date information, the problem of integration remains. New knowledge does not automatically incorporate itself into a coherent understanding of the world. A person who learns that a new treatment has been shown to work for a disease they know well immediately begins the process of updating their causal model, asking how this treatment interacts with the mechanisms they already understand, what it implies for patients they are currently treating, and what questions it raises about the underlying biology. That integration is intelligence at work, and no amount of retrieved text performs it automatically. The retrieved text is still just more raw material, and raw material without the capacity to process it intelligently is not intelligence. It is a pile. A very large, very well-organized pile, but still a pile, and piles do not think.
I want to close this section with something personal, because I promised myself I would always connect these arguments to lived experience, and I think it matters here. When I was struggling to find work after technology destroyed the career path I had been building, which I described in Technology Has Destroyed My Livelihood, one of the things I did was read a huge amount, about systems design, about machine learning, about distributed systems, about everything I thought I needed to know to get a job in the industry that had displaced me. I had knowledge. I was accumulating it as fast as I could. But knowledge without the intelligence to apply it in a context-sensitive, adaptive, creative way felt like carrying bricks without knowing how to build anything. I could answer trivia questions about distributed consensus algorithms and still have no idea how to think about the specific design problem sitting in front of me in an interview. The knowledge was necessary but nowhere near sufficient. What the interviewers were testing, whether they articulated it this way or not, was whether I could think with the knowledge, not just about it. Current AI systems have the same limitation at a structural level, and the people who build them know it, and the people who deploy them are slowly discovering it, and the people who use them will eventually notice it in the moments that matter most.
What Tools Actually Are, and Why They Cannot Give Intelligence to Their User
The second half of the title of this post is about tools, and I want to give tools their proper treatment, because tools are often presented as the solution to the intelligence gap I described in the previous section. The logic goes like this: even if a language model cannot reason causally by itself, if you give it access to a calculator, it can do arithmetic; if you give it access to a search engine, it can look up current information; if you give it access to a code interpreter, it can verify whether its reasoning is correct by running the code. This is the foundation of the agent paradigm that has dominated AI discourse for the past 2-3 years, where a language model sits at the center of a system that can call tools, take actions, observe results, and iterate. And I want to be honest that this paradigm has produced some genuinely impressive demonstrations. But impressive demonstrations are not intelligence, and I want to explain precisely why tools cannot bridge the gap, because the explanation is important and it is being almost entirely missed in the current conversation.
Tools extend what a system can do, but they do not change what the system fundamentally is. This is a principle that applies to humans as well as to machines, but in humans, there is always an intelligent agent deciding which tool to use, when to use it, how to interpret what the tool returns, and what it means in the context of the overall problem. When I use a calculator, the calculator does not make me a mathematician. It removes the burden of arithmetic so my intelligence can focus on the mathematical structure of the problem. The intelligence is still mine. The tool serves it. When a language model uses a calculator, the situation is structurally different, because there is no intelligence directing the use of the tool from outside the statistical process. There is only the statistical process trying to decide which tool to call based on patterns in its training data about when people typically use calculators. That call may often be correct, and when it is correct, the output looks intelligent. But correct and intelligent are not the same thing, and the system's inability to tell the difference between cases where the tool is appropriate and cases where it looks appropriate but produces the wrong answer is the clearest symptom of the intelligence gap. A truly intelligent system knows why it is using a tool. A system without intelligence knows only when tools tend to be used.
This distinction becomes catastrophically important in agentic settings, where a language model is given a long-horizon task, access to multiple tools, and the responsibility to plan a sequence of actions over time. Research from Stanford and elsewhere has shown that even the most capable language model agents fail systematically on tasks that require more than a few steps, that involve genuine novelty, or that require the agent to recognize when its current approach is wrong and to backtrack and try something different (3). These failures are not random. They follow predictable patterns tied to the statistical nature of the planning process. The agent tends to continue in the direction that looks most locally plausible rather than stepping back to reconsider the global strategy, because local plausibility is what next-token prediction optimizes for, and global strategic reasoning requires a kind of self-monitoring that statistical pattern matching does not naturally support. It can call a search engine and retrieve results, but whether it correctly identifies which retrieved result is actually relevant to the problem it is trying to solve, and why, and what it implies for the next step, is a matter of intelligence that the tool use does not provide. The tool returns a result. The system has to understand the result. And understanding is the thing that is missing.
I want to talk about function calling specifically, because it is the technical mechanism through which most AI tool use happens, and it is worth being precise about what it actually does. When a language model is given a set of function specifications and a user query, it learns to recognize patterns in the query that suggest which function should be called with which arguments, and to format the output accordingly. This is a genuine and useful capability, but from a cognitive standpoint it is essentially a very sophisticated keyword matching system that has learned to translate natural language requests into structured API calls. The intelligence that designed the functions, that decided what functions should exist, that determined what the functions should do, that understood how they relate to each other and to the domain they serve, that intelligence came from humans, and it lives in the function specifications, the documentation, and the structure of the API. The language model is calling functions that humans built using human intelligence. It is not exercising its own intelligence to figure out what functions should exist. When the right function exists and the query is within the distribution of what the model has seen during training, it works well. When the right function does not exist, or when the query requires combining functions in a genuinely novel way, or when the functions return unexpected results that require creative interpretation, the system has no recourse except to produce statistically plausible-sounding text, which may or may not be correct, and which the system has no reliable way to verify. That is not intelligence augmented by tools. That is pattern matching that happens to have some tools attached to it.
I also want to address the retrieval augmented generation paradigm specifically, because it is strongly associated with the idea that tools can fix the knowledge limitations I described in the previous section. Retrieval augmented generation works by embedding a query, searching a vector store for documents that are semantically similar to the query, and providing those documents as context to the language model so it can generate a response grounded in specific retrieved information rather than relying purely on its training data. This is genuinely useful in many practical settings, and I do not deny that it improves reliability compared to pure in-context generation. But it has two deep limitations. The first is that the retrieval step uses semantic similarity, meaning it retrieves documents that look like the query, not documents that are causally relevant to answering the question correctly. Good retrieval requires understanding what evidence is needed to answer a question, not just finding text that sounds related, and that understanding is a matter of intelligence that the embedding-based retrieval system does not possess. The second limitation is that even after retrieval, the model must synthesize the retrieved information with its prior knowledge, reason about what the retrieved documents actually say versus what they imply, identify contradictions between sources, and construct a coherent answer. That synthesis and reasoning is where intelligence is required, and the retrieval step does not provide it. Tools can find the books. They cannot read them intelligently.
The environmental and systemic costs of the tool-augmented AI paradigm deserve attention here, because they are real and they are being paid by people who did not choose to pay them. Running an agentic AI system that makes dozens or hundreds of API calls in the course of handling a single user request consumes computational, network, and financial resources at a rate that is dramatically higher than running a simple retrieval system. Those costs are not inherent to providing the tool-augmented capability. They are inherent to the specific architecture of a statistical language model trying to simulate intelligence by calling tools repeatedly. In Training Is an Evil Concept. LMMs Eliminates it Altogether., I noted that the environmental cost of training large models has been extensively documented in research by Strubell and colleagues (4). The cost of inference at scale is less frequently discussed but equally real, and when that inference is being done repeatedly by a system that is trying to compensate for its lack of intelligence by calling tools over and over until it finds something that works, the waste is structural and not incidental. A truly intelligent system would figure out which tool to call, call it once, and understand the result. A statistical system trying to simulate intelligence uses tools the way someone trying to remember a phone number they have forgotten might keep guessing digits, iterating through plausible combinations, hoping eventually to land on the right one. That is not intelligence using tools. That is the absence of intelligence being masked by repeated tool use.
I want to say something here that connects to my personal experience, because the tools question is deeply personal for me in a way that most people would not expect. When I was building the systems I described in my earlier posts, I spent years learning how to use sophisticated tools. Version control, profilers, debuggers, distributed tracing systems, load testers, all of the instruments of modern software engineering. And I learned something important that I think applies directly to the AI tool use question: a tool in the hands of someone who does not understand what they are trying to accomplish is not just useless. It is actively dangerous. It produces outputs that look like results, that can be formatted and reported and presented as evidence, but that are actually noise. It gives false confidence. I have seen junior engineers use profiling tools to identify bottlenecks and act on the results without understanding whether the identified bottleneck was actually the source of their performance problem, and the result was often that they optimized the wrong thing while the real problem remained untouched. The tool had done its job. The intelligence to interpret the tool's output was missing. That is exactly the situation with AI systems that call tools today. The tools do their jobs. The intelligence to interpret what those jobs actually mean for the problem at hand is not there.
The Intelligence That Was Never Stored
The previous two sections established that knowledge and tools are real, useful, and genuinely present in modern AI systems, but that neither of them is intelligence. This section is about what intelligence actually is, because you cannot argue that something is missing without describing what the missing thing is, and describing what intelligence is happens to be one of the hardest problems in all of cognitive science. I am going to try to describe it anyway, not because I have solved the hard problem of consciousness or because I have a complete theory of mind, but because I think there is enough common ground in cognitive science, neuroscience, and philosophy to make the case clearly enough to support the argument I am building. And I want to use simple words the whole time, because complex vocabulary is often a way of hiding from difficult ideas, and I do not want to hide from this one.
Intelligence, at its most basic, is the capacity to build internal models of the world that are generative, meaning they can produce new predictions about situations that were never directly experienced, and adaptive, meaning they update when the predictions turn out to be wrong. It is the capacity to abstract from specific observations to general principles, and then to apply those principles to new specific situations that were not part of the original abstraction. It is the capacity to recognize when a current approach is failing and to switch strategies without being told to switch. It is the capacity to formulate questions that have never been asked, because recognizing that a question needs to be asked is itself a product of understanding the domain well enough to notice what is missing. It is, at a deeper level, the capacity to be genuinely surprised by the world, because surprise requires a model of what was expected, against which the actual outcome can be compared and found to be different. None of these capacities are properties of stored knowledge, and none of them are properties of tools. They are all properties of a process, an ongoing, dynamic, self-correcting process of building, testing, and revising models of reality. Cognitive scientists call this process model-based reasoning, and a large body of research establishes that it is distinct from the kind of pattern matching that characterizes both animal conditioning and artificial neural network learning (5).
The most important property of intelligence, the one that most clearly separates it from knowledge retrieval, is what researchers call systematic compositionality. Humans can take a finite set of known concepts and combine them in an infinite number of novel ways to produce new thoughts. If you understand what a dog is, and you understand what a purple is, and you understand what a mountain is shaped like, you can immediately form a coherent mental image of a purple dog sitting on top of a mountain, even though no text you ever read described exactly that scene. You can then reason about what kind of behavior such a scene would cause if you encountered it, how you might photograph it, what the lighting would look like at sunset, what the dog's fur would feel like at altitude. None of that reasoning requires retrieving a stored description. It requires composing known concepts in a new configuration and then running your model of the world forward to produce new predictions. Language models fail at systematic compositionality in ways that have been carefully documented (6). When tested on tasks that require combining known concepts in configurations that differ from training examples, their performance drops dramatically, while human performance stays consistent because humans are using a compositional generative model rather than pattern matching to a database of seen combinations. This is not a data problem. It is not a scale problem. It is a structural problem: statistical pattern matching over tokens does not naturally produce compositional generative representations, and no amount of training data changes that structural fact.
I also want to talk about what researchers call the frame problem, because it is one of the oldest and most stubborn problems in artificial intelligence, and it is directly relevant to why intelligence cannot be reduced to knowledge plus tools. The frame problem, originally identified in the context of symbolic AI and later shown to be equally relevant to connectionist systems, is the problem of knowing what changes when something happens and what does not change (7). When you push a coffee cup across a table, you do not have to explicitly reason about the fact that the color of the walls has not changed, or that the gravitational constant is still the same, or that the laws of physics still apply. You take all of that for granted because your intelligence has an implicit model of the world that marks what is relevant to the current action and what is not. A system without that implicit model has to either reason about everything explicitly, which is computationally intractable, or make assumptions about what is relevant based on statistical priors, which leads to systematic failures whenever the situation differs from the training distribution. Language models handle the frame problem statistically: they learn what kinds of things tend to change together in the text they were trained on, and they generate outputs consistent with those learned associations. This works within the training distribution and fails outside it, exactly as the research on compositional generalization predicts. Intelligence has a solution to the frame problem. Knowledge does not. Tools do not. Only the generative model of the world that intelligent beings build from direct engagement with reality provides a principled way to know what changes and what stays the same, because that model encodes the causal structure of the world rather than the statistical associations in descriptions of the world.
Let me connect this to something I said in Mathematical Equations are Multimodal by default, because that post was building toward the same point from a different direction. I argued there that equations encode mechanisms rather than descriptions, and that this encoding is what gives them their power to generate predictions in multiple modalities from a single compact representation. What I want to add here is that this property of equations is exactly what a generative model of the world needs. An intelligent system that has discovered the differential equation governing a physical process does not need to store examples of what that process looks like. It can generate new examples by running the equation forward, it can predict outcomes by running the equation, it can diagnose interventions by modifying variables in the equation, and it can check its predictions against new observations. That is intelligence at work, and it is specifically the kind of intelligence that knowledge retrieval and tool use cannot produce, because it requires the internal model to be mechanistic and generative rather than associative and retrieval-based. The equation is not stored knowledge. It is a compressed theory of how the world works in a specific domain, and theories are the products of intelligence, not the inputs to it.
I know there will be people who say that large language models show signs of emerging reasoning ability, that they can solve novel math problems, draw analogies, and demonstrate knowledge transfer that suggests something more than pure pattern matching is happening. I take these claims seriously because the researchers making them are often serious people, and I want to engage with them honestly rather than dismissing them. But I think the evidence, when examined carefully, consistently shows that what looks like reasoning is in most cases very sophisticated pattern completion. The argument that I find most compelling comes from work on symbolic reasoning tasks by Marcus and colleagues, and separately from the work on large language model failures by Dziri and colleagues, both of which show that model performance on reasoning tasks is highly sensitive to surface features of the problem presentation in ways that true reasoning should not be (8). If a model had genuinely reasoned its way to an answer, rephrasing the problem in a different surface form should not change the answer, because the reasoning would be operating on the underlying structure rather than the surface pattern. But in experiment after experiment, that is exactly what happens: changing the surface form changes the answer, revealing that the model was matching to patterns in its training data rather than reasoning from a model of the problem's underlying structure. This is the fingerprint of pattern matching, not intelligence, and it is visible in the data if you know where to look.
The neuroscientific perspective adds another layer of evidence that is worth considering here. Human intelligence is not just a property of the neocortex doing something that looks like language processing. It is distributed across a vast system that includes sensorimotor representations, embodied predictions, emotional signals that carry information about risk and relevance, episodic memory that preserves the specific context of past events, and a default mode network that keeps running simulations of past and future situations even when no external task is being performed (9). A language model trained on text interacts with none of this biological substrate. It receives text, processes it through transformer layers, and produces text. The richness of the representations available to the human mind, representations shaped by years of embodied experience in a physical world that pushes back, that has real consequences for wrong predictions, that produces pain when you touch something hot and satisfaction when you solve a real problem, none of that richness is present in the statistical weights of a language model. I said in Language is Limited. ASI is Impossible. that the brain is not a text machine, and what I am adding here is that intelligence is not a text property. It is a property of a certain kind of dynamic, embodied, feedback-coupled engagement with a real world, and language models, by design, have none of that engagement. They have the text that humans produced from that engagement. That is not the same thing. That will never be the same thing regardless of how large the model gets.
Why Retrieval Is Not Reasoning, No Matter How Fast It Is
One of the most powerful illusions in the current AI conversation is the conflation of retrieval speed with reasoning ability. When a language model produces an answer in seconds that would take a human expert minutes or hours to formulate, the speed feels like evidence of superior intelligence. This reaction is understandable and I have felt it myself. But speed of retrieval is not the measure of intelligence, and this confusion is worth confronting directly because it shapes so many of the most popular claims about what current AI systems can do. A search engine can return millions of results in milliseconds, and nobody claims that the search engine is intelligent. The speed is a property of the indexing structure and the hardware, not of any reasoning process. Language models retrieve from a much richer index, one that stores not just documents but compressed patterns of association between ideas, and they produce their retrievals in fluent prose rather than ranked links, which makes the retrieval feel much more like thinking. But the underlying operation is still fundamentally retrieval, and the felt quality of the output tells us nothing about whether the process that produced it is intelligent.
I want to be precise here because I think the argument requires precision to be convincing. When a human expert produces an answer to a complex question, the process involves more than retrieving a stored answer. It involves identifying which parts of the question are familiar and which are novel, constructing a representation of the question's structure that allows the relevant knowledge to be brought to bear, reasoning about how the relevant knowledge connects to the specific question, checking the emerging answer against constraints imposed by other things the expert knows, and often revising the answer as the reasoning process produces unexpected implications. All of that is intelligence operating on knowledge. The output may look like a stored answer because the expert has answered similar questions before, but the process that produced it is generative and adaptive, not purely retrievative. When a language model produces a similarly fluent answer to a similarly complex question, the process is much closer to pattern completion over learned associations. The model has not identified the novel aspects of the question and reasoned about them specifically. It has found the region of its learned representation space that is most consistent with the input and sampled from the distribution of outputs associated with that region. When the question is similar to things in the training data, this produces impressive-looking results. When the question is genuinely novel in its structure, the results degrade in characteristic ways that reveal the retrieval nature of the underlying process.
The benchmark results that are most often cited as evidence of AI reasoning ability deserve scrutiny here, because they are consistently misinterpreted in the public conversation. When a language model achieves a high score on a standardized reasoning benchmark, the natural interpretation is that the model has learned to reason in the way the benchmark was designed to test. But performance on a standardized benchmark can be achieved either through genuine reasoning ability or through having seen enough examples of the benchmark during training to pattern-match to the test questions. Researchers have repeatedly shown, by constructing modified versions of popular benchmarks that preserve the underlying reasoning structure while changing the surface form, that model performance drops dramatically on the modified versions while human performance stays consistent (10). This is the smoking gun. If the model had learned to reason, it would transfer that reasoning to the modified version, because the structure is the same. The fact that it does not transfer reveals that the performance was achieved through pattern matching to the training distribution rather than through genuine reasoning ability. I described something similar in Rethinking ARC‑AGI, where I showed that ARC-AGI version 1 was undermined by brute-force search precisely because the evaluation was measuring something that could be achieved without genuine reasoning, and the same principle applies to most of the benchmarks currently used to evaluate AI progress.
I also want to address chain-of-thought prompting specifically, because it is often presented as evidence that language models can reason when given the opportunity to show their work. Chain-of-thought prompting involves asking a model to produce a sequence of reasoning steps before giving a final answer, and it does improve performance on many tasks compared to asking only for the answer directly. I want to be careful here because I think the evidence is genuinely nuanced, and oversimplifying it would make my argument weaker rather than stronger. Chain-of-thought prompting improves performance on tasks within the training distribution because producing reasoning steps is itself a pattern that the model has learned to reproduce, and that pattern, when reproduced, tends to activate the regions of representation space that contain the right answer. It is, in a meaningful sense, a productive pattern to reproduce. But research by Lanham and colleagues has shown that the chain-of-thought steps produced by language models are often not causally connected to the final answer in the way that genuine reasoning steps would be (11). In experiments where the intermediate reasoning steps are deliberately made incorrect while keeping the problem the same, models often still produce the correct final answer, which reveals that they were not actually using the intermediate steps to compute the answer. They were producing plausible-sounding reasoning steps in parallel with pattern-matching to the final answer, and the two processes were not causally coupled. That is the opposite of reasoning. Reasoning is when the steps cause the conclusion. What chain-of-thought prompting often produces is a conclusion that was already determined by pattern matching, followed by a plausible post-hoc rationalization of that conclusion. That is not a small distinction. It is the whole game.
The speed question connects back to the economics of AI deployment in a way that I think deserves to be made explicit. When organizations evaluate AI systems for deployment in high-stakes settings, one of the most frequently cited advantages is speed. The system can process a thousand documents in the time it takes a human analyst to read three. The system can generate a first draft of a legal brief in seconds compared to the hours a junior associate would need. The system can evaluate loan applications at a rate that would require a hundred human underwriters. All of these speed advantages are real, and they are economically valuable, and they are part of why organizations are deploying these systems at the scale they are. But the comparison is being made on the wrong dimension. The relevant question is not whether the AI system is faster than a human at retrieving and organizing information. The relevant question is whether the AI system's outputs are as reliable as a human expert's outputs in cases that require genuine reasoning rather than pattern matching within the training distribution. And the answer to that question, for currently deployed systems in currently deployed settings, is consistently no, as documented in legal, medical, financial, and scientific contexts where AI-assisted decisions have been compared against ground truth (12). The systems are fast and impressive within their training distribution. They are unreliable in novel situations that require reasoning. And the novel situations are exactly the ones where speed matters most and where errors are most costly.
I want to bring this down to something concrete from my own experience building systems, because I think the abstract argument lands better when it is grounded. When I was working on systems that needed to make reliable decisions under uncertainty, the most important thing I learned was what engineers call the difference between a system that is right and a system that sounds right. A system that sounds right produces fluent, confident, well-formatted outputs that are often correct. A system that is right produces outputs that are provably connected to the information they were computed from, that degrade gracefully when that information is incomplete, and that signal uncertainty when the underlying computation cannot produce a reliable answer. The first kind of system is easy to demonstrate to stakeholders and hard to catch in its failures until the failures have real consequences. The second kind of system requires more design effort and more epistemic humility, and it produces outputs that look less impressive in a demo, but it is the system you actually want when real decisions depend on it. Language models as currently built are the first kind of system. They sound right. A genuinely intelligent system would be the second kind: right, in the sense of being connected to the truth by a traceable chain of reasoning, not just in the sense of producing outputs that match the expected surface form.
The Benchmark Trap: How We Trained Ourselves to Celebrate the Wrong Thing
One of the most reliable signs that a field has lost track of what it was trying to measure is when its benchmarks become the goal rather than the measure, and modern AI has been in this trap for long enough that most people inside the field have stopped noticing the trap. I wrote about this specifically in the context of ARC-AGI in Rethinking ARC‑AGI, where I documented how version one of that benchmark was undermined by brute force search and how even the improved version two still measures something narrower than the general reasoning ability it was intended to capture. But the ARC-AGI problem is only one instance of a much more general problem, which is that nearly every benchmark used to evaluate AI progress measures performance in ways that can be achieved through sophisticated pattern matching rather than genuine intelligence, and when a system achieves high performance on such a benchmark, we announce progress toward intelligence when we have only measured progress toward benchmark performance. Those two things are not always the same, and the history of AI benchmarks is the history of systems achieving high performance by learning the distribution of the benchmark rather than by learning the underlying capability the benchmark was intended to proxy.
The history of this problem is long enough to be instructive. Language modeling benchmarks like GLUE and SuperGLUE were introduced to measure natural language understanding, and models quickly achieved human-level performance on them, leading to announcements of human-level language understanding. But when researchers probed the same models with carefully constructed probes designed to test whether the models were using the right kind of information to achieve their performance, they consistently found that models were using spurious correlations in the datasets, artifacts of data collection that correlated with the correct label but that had nothing to do with the linguistic understanding the benchmark was supposed to measure (13). The models had learned the benchmark distribution, not natural language understanding. This phenomenon is so well-documented and so frequently rediscovered that it has a name: Goodhart's Law, which states that when a measure becomes a target, it ceases to be a good measure. Every time a new benchmark is introduced, systems eventually optimize for the benchmark's specific distribution, and every time that happens, the community either acknowledges the limitation and moves to a harder benchmark or, more commonly, continues to cite the benchmark performance as evidence of the capability it was supposed to measure. The trap resets and closes again around a new target.
What makes this particularly damaging to the public conversation about AI is that each benchmark milestone gets reported in the popular press as evidence of AI approaching or surpassing human intelligence in a specific domain, and those reports shape the expectations of the people who make decisions about AI policy, AI deployment, and AI investment. When a model achieves human-level performance on a medical licensing exam, headlines announce that AI is ready to be a doctor. When a model achieves expert-level performance on a bar exam, headlines announce that AI is ready to practice law. These headlines are not technically false in a narrow sense, but they are deeply misleading in a broader sense, because they interpret performance on a standardized test as evidence of the real-world capability that the test was designed to proxy, and the evidence consistently shows that the proxy relationship is weaker than it sounds. Research comparing AI system performance on medical licensing exams to AI system performance on actual clinical reasoning tasks has found that the high exam scores do not translate to reliable clinical reasoning, precisely because clinical reasoning requires adapting to novel patient presentations that differ from the training distribution, while exam performance can be largely achieved through pattern matching to the distribution of exam questions (14). The benchmark is the map. The capability is the territory. And when the map becomes the goal, the territory is forgotten.
I want to talk about what a benchmark for intelligence would actually need to look like, because I think it is possible to design better evaluations and I want to be constructive rather than only critical. A benchmark for genuine intelligence, not performance on a fixed distribution of tasks, would need several properties that current benchmarks lack. First, it would need to use genuinely novel tasks, tasks that were designed specifically to lie outside any plausible training distribution, so that pattern matching to seen examples is definitionally impossible. Second, it would need to test transfer: after exposing the system to a novel domain long enough to learn its basic structure, can the system apply that structure to new problems that were not part of the introduction, the way a genuinely intelligent person would? Third, it would need to test self-diagnosis: can the system accurately identify when it does not know something and appropriately signal uncertainty rather than producing confident-sounding outputs that happen to be wrong? Fourth, it would need to test systematic compositionality: can the system combine known concepts in genuinely novel configurations and correctly predict the properties of the combination without having seen that specific combination during training? ARC-AGI was attempting to measure some of these properties, which is why it is more interesting than most benchmarks, but the execution requirements have proven easier to satisfy through non-intelligent means than the designers hoped. Designing a truly contamination-proof evaluation of genuine reasoning ability is hard, and the field mostly responds to that difficulty by lowering the bar rather than by doing the hard design work.
The economic incentives behind benchmark culture deserve explicit attention, because they are the fuel that keeps the trap running. Companies that build AI systems have strong incentives to report benchmark performance because benchmark numbers are legible to investors, partners, and journalists in a way that nuanced capability descriptions are not. A statement that a model achieves 90 percent accuracy on a named benchmark communicates something that sounds precise and impressive and that can be compared directly to previous models, even if the benchmark is measuring something that is not actually what anyone cares about. The incentive to report on benchmarks and the incentive to optimize for benchmarks are the same incentive, operating at different stages of the development pipeline, and the result is that the entire industry moves in the direction of benchmark optimization rather than in the direction of genuine capability development, because benchmark optimization produces numbers that look good in press releases, while genuine capability development produces systems that work reliably in real-world deployments and are harder to reduce to a single number. I have seen this same dynamic in every technology-driven industry I have worked in or adjacent to, and it always produces the same eventually visible hollowness, the gap between the reported capabilities and the actual performance that users eventually discover and that researchers document and that the press eventually reports on, usually much later than the evidence warranted.
I want to be fair to the researchers who design these benchmarks, because most of them know their limitations and say so clearly in their papers. The problem is not that benchmark designers are dishonest. The problem is that the gap between what a benchmark can measure and what the field needs to know gets systematically collapsed in the translation from paper to press release to public understanding. A paper that introduces a new benchmark typically includes extensive caveats about what the benchmark does and does not measure, which aspects of the task are most likely to be achieved through pattern matching, and what the results should and should not be taken to imply. Those caveats are in the paper. They are not in the press release. They are certainly not in the Twitter thread or the LinkedIn post that most people in the public actually read about the result. The field has a responsibility to close that gap, and so far it has mostly chosen not to, because closing the gap would require saying things that are more complicated and less impressive than the simple narrative of steadily increasing AI capabilities measured by steadily improving benchmark performance. I am saying those complicated things here because somebody has to say them plainly, and because I think the people who read these posts deserve the plain version more than they deserve the comfortable one.
The right response to the benchmark trap is not to give up on evaluation. It is to design evaluations that measure what we actually care about, to interpret benchmark results with appropriate humility about their scope, and to resist the pressure to translate every number into a narrative about general AI capability that the number cannot support. This is easier said than done in an environment where every major AI lab is in competition for investor confidence and public attention. But the researchers who are doing this work honestly, who design their benchmarks carefully and interpret their results conservatively, are the ones whose work will look good in ten years, when the gap between the benchmark numbers and the real-world performance of the systems those numbers were supposed to represent has become impossible to ignore. I have enormous respect for that kind of scientific honesty because I know how hard it is to maintain when the surrounding culture rewards the opposite.
What Real Intelligence Would Look Like, If We Built It
I have spent several sections describing what intelligence is not, and I owe the reader something more than that. I owe a description of what real intelligence would look like if a machine actually had it, because the argument that current systems are not intelligent is only useful if there is a direction the field could go toward instead. I am not going to pretend I have a fully worked-out engineering plan for building a genuinely intelligent machine, because I do not, and anyone who claims to have such a plan should be viewed with extreme skepticism. What I do have is a set of properties that any system would need to have in order to earn the word intelligence without putting quotation marks around it, and I think describing those properties is valuable because it clarifies what the goal actually is and how far away we currently are from it.
The first property a genuinely intelligent system would need is a generative causal model of the domain it operates in. Not a collection of associations between inputs and outputs. Not a set of retrieved documents that happen to be relevant to the query. A structural model that encodes the causal relationships between variables in the domain: how this causes that, why changing X changes Y in this specific way, what would happen if you intervened on Z. I argued in Mathematical Equations are Multimodal by default that mathematical equations are the most compressed and honest form of such models for physical domains, and I stand by that argument. For non-physical domains, the structure of causal models is less obvious, but the principle is the same: a model that encodes mechanisms rather than associations is a model that can reason genuinely rather than retrieve approximately. Research on causal representation learning has made real progress toward systems that can learn causal structure from data, and I believe this line of work is more important for the future of real AI than all of the work on scaling language models combined (15). It is harder, it is slower, it is less impressive in demos, and it is the right direction.
The second property is systematic compositionality, which I already described earlier in this post. A genuinely intelligent system should be able to take known concepts and combine them in novel ways to produce new thoughts, new predictions, and new understanding about situations that were never in its training data. This is a property that human intelligence has reliably, that current AI systems lack reliably, and that researchers have been trying to build into neural systems since the early days of connectionism. The fundamental problem is that the kind of representations needed for systematic compositionality are structured representations, where the parts maintain their identity when combined and where the combination respects the structure of both parts, but the kinds of representations that neural networks learn naturally are distributed representations, where information is spread across many neurons and where composition is achieved through learned associations rather than through structural operations. There are architectures designed to bridge this gap, including memory-augmented neural networks, slot-based representation systems, and neural symbolic hybrids, and the research on them is promising but not yet at the capability level needed to demonstrate genuinely systematic compositionality at scale (16). The problem is real and the research is real and the gap from current language models is real and large.
The third property is honest uncertainty representation. A genuinely intelligent system would not produce fluent, confident-sounding outputs when it does not have a good basis for confidence. It would signal uncertainty in proportion to the actual uncertainty in its knowledge, flag when a question is outside the scope of what it can reliably answer, and defer to other sources when those sources are more reliable. This is a property that humans have imperfectly but that we recognize as epistemically virtuous when we see it, and that we correctly identify as a fault when it is absent. Calibrated confidence is not just a nice-to-have property of intelligence. It is a fundamental epistemic capability, the ability to know what you know and to know what you do not know, and Socrates was identified as the wisest man in Athens specifically because he had this capacity when everyone else lacked it. Current AI systems are systematically overconfident in ways that are well-documented (17). They produce detailed, confident answers on topics where no confident answer is warranted, and they do so because their training objective rewards producing answers that look good, not answers that accurately represent the system's epistemic state. Fixing this requires either changing the training objective fundamentally, which is technically difficult, or acknowledging that the systems are not intelligent in the sense that includes self-knowledge, which is philosophically important.
The fourth property is genuine novelty generation, which is the capacity to produce something that was not in any sense present in the training data and that represents a real advance beyond what was already known. This is the hardest property to evaluate and the one most likely to be confused with sophisticated interpolation. Language models can produce outputs that seem novel because they combine elements in ways that have not been seen before, but the combination is still a function of the training data in a way that true novelty is not. When a human scientist makes a genuinely novel discovery, they are not interpolating between known states of the training distribution. They are finding a pattern in the world that nobody had found before, using a model of the world that they built from scratch through years of engagement with the domain. That kind of genuine novelty generation is what I described in I described in Announcing Kevin RS, where the goal of the project is to build systems that can discover things that are genuinely new rather than systems that can fluently describe things that are already known. The difference between those two goals is the difference between the past and the future of AI, and I do not think the field is taking that distinction seriously enough.
The fifth property, which might be the most uncomfortable to say out loud, is the capacity for genuine motivation. A genuinely intelligent system would need some internal representation of what it is trying to do and why, some sense of the difference between making progress and not making progress, some capacity to care about the quality of its own understanding rather than just the quality of its outputs. I am not making a claim about consciousness here, because I do not think consciousness is required for intelligence in the functional sense I am describing. I am making a claim about the distinction between a system that optimizes an external objective and a system that has internalized a goal structure that it pursues on its own terms. Current AI systems optimize external objectives, specifically the objectives encoded in their training losses and reinforcement signals. They do not have internalized goals that they pursue intelligently in their own right. That is why they are fundamentally tools rather than agents, even when they are called agents in the marketing materials. A tool does what it is pointed at. An agent does what it cares about. Intelligence lives on the agent side of that distinction, and current systems are on the tool side regardless of how agentic their packaging appears.
The Gap Between Capability and Understanding.
The argument I have made in this post could sound like an academic debate about definitions, about what counts as intelligence versus capability versus understanding. I want to be very clear that it is not. The gap between genuine intelligence and the impressive-but-not-intelligent systems we are currently deploying at massive scale has consequences for real people in real situations, and I want to make those consequences concrete because they are the reason I am writing this post rather than keeping the argument inside the technical literature where it might be safely ignored by everyone who needs to hear it most. I described some of these consequences in Technology Has Destroyed My Livelihood, but here I want to go deeper into the specific ways that the conflation of capability with intelligence is causing harm that could have been avoided if the people deploying these systems had been honest about what they actually had.
The legal system is one of the most consequential domains in which AI systems are being deployed, and it is a domain in which the difference between genuine reasoning and sophisticated pattern matching is the difference between justice and injustice. Courts have already encountered cases where AI systems were used to assess recidivism risk in sentencing recommendations, and where the outputs of those systems, produced by pattern matching over historical criminal justice data that encodes decades of racially biased policing and prosecution practices, were treated as objective assessments of individual defendants' future behavior. The COMPAS system is the most documented example: researchers at ProPublica found that the system's outputs were systematically biased against Black defendants, assigning higher risk scores even when controlling for actual reoffending rates (18). The system was not intelligent. It was pattern matching, and the patterns it learned from were discriminatory patterns encoded in historical data. The people who deployed it and the judges who used its outputs treated it as if it were providing intelligent assessment, and that misclassification of pattern matching as intelligence produced documented injustice. That is not an academic consequence. It is a consequence measured in years of people's lives.
The healthcare deployment of AI systems raises the same constellation of concerns at a scale that is only growing. AI-assisted diagnostic systems, clinical decision support tools, triage algorithms, and treatment recommendation engines are being deployed in hospitals and clinics around the world, and the evaluations presented in marketing materials typically measure performance on test sets drawn from the same distribution as the training data, which tells us almost nothing about how the systems perform on the actual patients who will be affected by their outputs. Patients who fall outside the demographic distribution of the training data, patients who present with atypical symptoms, patients whose conditions are rare or novel, these are exactly the patients for whom the difference between real reasoning and pattern matching matters most, and these are exactly the patients for whom well-documented AI diagnostic failures tend to be concentrated. Research has shown that AI diagnostic systems trained predominantly on images from lighter-skinned patients perform significantly worse on darker-skinned patients, not because of any intentional design choice but because skin tone correlates with diagnostic signal in ways that differ systematically across demographic groups, and pattern matching over a biased training distribution learns the biased pattern (19). A genuinely intelligent diagnostic system would reason from biological mechanisms rather than matching to training patterns, and its performance would not depend on whether the patient looks like the patients in the training set. No currently deployed system has that property.
The educational technology sector is deploying AI at an even larger scale, and the consequences here are subtler but no less real. When AI writing assistants and question-answering systems are used extensively in educational settings, and students come to rely on them for the kind of thinking that education is supposed to develop, the students absorb the pattern: retrieve from the AI, not think it through yourself. This might be acceptable if the AI were actually reasoning, because then the student would at least be learning from genuine reasoning. But if the AI is pattern-matching and the student is copying the pattern-match output while bypassing the reasoning process, then the student is not learning to reason. They are learning to use a tool that produces outputs that look like the results of reasoning, which is a very different thing. Education is supposed to develop the capacity for intelligence in students. AI tools that substitute for that development rather than scaffolding it are doing the opposite of what education is for. And the teachers who deploy these tools with good intentions, wanting to save students time or to differentiate instruction, are often not informed about the difference between a tool that reasons and a tool that pattern-matches, because the marketing materials for these tools do not make that distinction clearly, and because the research literature on the difference is not easily accessible to non-specialist practitioners.
The economic consequences of the intelligence gap play out where I discussed in As Engineers, LLMs should pay us for tokens usage and Training Is an Evil Concept. LMMs Eliminates it Altogether.. When people and organizations are told that AI systems are intelligent, they make decisions about labor, skill development, and investment on the basis of that characterization. Jobs that require actual reasoning are restructured around AI systems that do not actually reason, leading to output quality that is worse than it would have been with a human expert but that is harder to audit because the AI-produced material is fluent and plausible enough to pass cursory review. The humans who were doing the reasoning are displaced. The AI systems do not have the intelligence to actually replace them. The organizations end up with lower quality outputs and higher reputational risk when the failures become visible, and nobody is accountable because the decision to deploy was based on benchmark numbers interpreted as intelligence claims rather than on honest assessment of what the systems could and could not actually do. This is not a hypothetical. It is already happening in software development, in legal services, in content production, in financial analysis, and in dozens of other domains where AI has been deployed on the basis of capability claims that conflated pattern matching with reasoning.
The most dangerous long-term consequence of the intelligence gap is epistemic, and it is the consequence that I think about most. If a large fraction of the information that people encounter is produced by systems that pattern-match rather than reason, and if people cannot easily distinguish AI-produced content from content produced by genuine reasoning, then the epistemic environment degrades in a specific and serious way. People learn to accept fluent, confident-sounding content as a proxy for reliable content, because that has always been a reasonable heuristic when fluent, confident-sounding content was mostly produced by people who had to know what they were talking about to produce it. AI changes the base rate: now fluent, confident-sounding content can be produced in unlimited quantities by systems that know nothing in the honest sense of the word, and the traditional heuristic fails. I wrote about this specific dynamic in LLMs destroyed the Internet. LMMs will make it alive. where I described how the mass deployment of content generation has already degraded the reliability of the web as an information environment. The intelligence gap is the root cause of that degradation, and it will not be fixed by better content moderation or by users becoming more skeptical. It will only be fixed by building systems that actually reason, so that fluent and confident can once again be reasonable proxies for reliable.
Where We Go From Here, and What Honest Progress Would Look Like
I want to end this post by saying something constructive, because I am aware that I have spent a long time taking apart the current state of affairs, and taking things apart without pointing toward what a repaired version would look like is the habit of cynics, and I refuse to be a cynic even when cynicism is easy. I believe in the possibility of building genuinely intelligent machines. Not because current systems are almost there, but because I believe the scientific problems that need to be solved to build them are genuine problems that science can make progress on, and I believe the people working on those problems are making real progress, even if that progress is slower and less glamorous than the scaling-based progress that gets most of the attention. The right direction is not more scale applied to the current architecture. The right direction is fundamentally different architectures that prioritize the properties I described in the previous sections: causal generative models, systematic compositionality, calibrated uncertainty, genuine novelty generation, and something like internalized goals.
Causal machine learning is one of the most important and underinvested directions in AI research. The work of Judea Pearl on causal inference and do-calculus has given the field a rigorous mathematical framework for reasoning about causation rather than correlation, and researchers are beginning to extend this framework in ways that could eventually be realized in learned systems (15). Bernhard Schölkopf's group has been developing the theory of causal representation learning, which aims to learn the causal variables and their structural relationships from observational data rather than requiring explicit experimental intervention (20). These directions are hard. They require theoretical innovation at least as much as they require computational scale. They produce results that are harder to demonstrate impressively in a short demo than language model capabilities. And they are the right direction, because they are building toward systems that can genuinely understand rather than systems that can fluently retrieve. The lmm project I described in Training Is an Evil Concept. is one concrete implementation that moves in this direction, with symbolic regression for equation discovery, physics simulation, and explicit causal reasoning. It is a proof of concept, not a complete system, and I say that honestly, but proof of concept matters because it proves that the alternative exists and is engineerable.
Neurosymbolic AI is another direction worth watching, because it attempts to combine the pattern recognition strengths of neural networks with the structural reasoning strengths of symbolic systems. The symbolic AI tradition, which dominated the field before the deep learning wave and which was prematurely written off as obsolete when deep learning achieved its first impressive results, has real strengths in exactly the areas where neural networks are weakest: systematic compositionality, explicit causal reasoning, logical inference, and honest uncertainty representation. The neurosymbolic research agenda is trying to build hybrid architectures that get the best of both worlds, and while the engineering challenges are significant, the theoretical case for the approach is strong. Vaishak Belle and Gary Marcus have argued that neurosymbolic integration is the most promising route to systems that combine the learning capabilities of neural networks with the reasoning capabilities of symbolic systems (16). The research is real, the progress is real, and the goal is architecturally appropriate in a way that pure scaling of language models is not.
Honest benchmark design is another place where the field could make genuine progress without waiting for fundamental architectural breakthroughs. If the research community committed to designing benchmarks that are genuinely contamination-proof, that test transfer rather than memorization, that test systematic compositionality rather than interpolation within the training distribution, and that include measures of calibration and uncertainty quantification alongside measures of accuracy, the field would at least have better information about how far current systems are from genuine intelligence, and that honest information is more valuable than inflated benchmark numbers in the long run. Some researchers are already moving in this direction: the BIG-Bench project, the HELM benchmark suite, and the various adversarial evaluation approaches that have been developed in recent years all represent genuine attempts to evaluate capability more honestly. These efforts deserve more support, more institutional recognition, and more attention in the press than they currently receive.
For me personally, the path forward is the same thing it has always been, which is writing what I believe honestly and building what I can concretely. I believe that knowledge is not intelligence. I believe that tools extend reach but do not create understanding. I believe that genuine intelligence requires causal models of the world, systematic compositional reasoning, calibrated uncertainty, and something like goals that go beyond optimizing an external loss function. I believe that the gap between what current AI systems have and what genuine intelligence requires is large, is structural rather than merely a matter of scale, and matters enormously for every domain in which these systems are being deployed. And I believe that saying these things plainly, in simple words, is more useful than softening them into a more comfortable form, because the comfortable form has been available for years and has not moved the conversation in the direction that needs to be moved.
The people who will build genuinely intelligent machines are not necessarily the people at the largest labs with the most compute. They are the people who are working on the right problems, building causal models rather than larger language models, developing theory of compositionality rather than collecting more training data, designing honest evaluations rather than finding better benchmarks to optimize for. Those people exist. Their work is real. It is getting funded less than it should be. It receives less attention than it deserves. And it is more important for the actual future of AI than almost anything currently being discussed in the mainstream AI conversation. I have enormous respect for those researchers because I know what it is like to be working on something real in an environment that rewards something flashier, and I know that the determination required to keep working on the right thing when the rewards keep flowing to the wrong thing is a form of intelligence in itself.
The title of this post is a description of the actual situation, not a slogan. Every AI system that calls a function, searches a database, retrieves a document, or generates a response is doing something with knowledge and tools. None of them are doing the thing I have called intelligence in this post. That is where we are. The direction I have described is where we need to go. The gap between those two points is the most important unsolved problem in artificial intelligence, and honest acknowledgment of that gap is the necessary first step toward actually closing it. I do not know when it will be closed. I do not know whether I will be around to see it. But I know that it is worth working toward and worth being honest about, and those two things are enough to keep going.
Till next time 👋!
References
1. Bender, E. M. et al., On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?, ACM FAccT 2021
2. Obermeyer, Z. et al., Dissecting racial bias in an algorithm used to manage the health of populations, Science, 2019
3. Kambhampati, S. et al., LLMs Can't Plan, But Can Help Planning in LLM-Modulo Frameworks, arXiv:2402.01817
4. Strubell, E. et al., Energy and Policy Considerations for Deep Learning in NLP, arXiv:1906.02243
5. Lake, B. M. et al., Building Machines That Learn and Think Like People, Behavioral and Brain Sciences, 2017
6. Dziri, N. et al., Faith and Fate: Limits of Transformers on Compositionality, arXiv:2305.18654
7. McCarthy, J. & Hayes, P. J., Some Philosophical Problems from the Standpoint of Artificial Intelligence, Machine Intelligence, 1969
8. Marcus, G., Deep Learning: A Critical Appraisal, arXiv:1801.00631
9. Damasio, A., Descartes' Error: Emotion, Reason, and the Human Brain, NIH National Library of Medicine, 1994
10. McCoy, R. T. et al., Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference, arXiv:1902.01007
11. Lanham, T. et al., Measuring Faithfulness in Chain-of-Thought Reasoning, arXiv:2307.13702
12. Cabitza, F. et al., Unintended consequences of machine learning in medicine, JAMA, 2017
13. Gururangan, S. et al., Annotation Artifacts in Natural Language Inference Data, arXiv:1803.02324
14. Kanjee, Z. et al., Accuracy of a Generative Artificial Intelligence Model in a Complex Diagnostic Challenge, JAMA, 2023
15. Pearl, J., Causality: Models, Reasoning, and Inference, Cambridge University Press, 2009
16. Belle, V. & Marcus, G., The Future Is Neuro-Symbolic: Where Has It Been, and Where Is It Going?, Proceedings of the AAAI Conference on Artificial Intelligence, 2026
17. Kadavath, S. et al., Language Models (Mostly) Know What They Know, arXiv:2207.05221
18. Angwin, J. et al., Machine Bias, ProPublica, 2016
19. Groh, M. et al., Evaluating Deep Neural Networks Trained on Clinical Images in Dermatology with the Fitzpatrick 17k Dataset, arXiv:2104.09957
20. Schölkopf, B. et al., Toward Causal Representation Learning, arXiv:2102.11107