J A B B Y A I

Loading

Today’s neural networks are inscrutable — nobody really knows what a neural network is doing in its hidden layers. When a model has billions of parameters this problem is multiply difficult. But researchers in AI would like to know. Those researchers who attempt to plumb the mechanisms of deep networks are working in a sub-branch of AI called Explainable AI , or sometimes written “Interpretable AI”.

Chat bots and Explainability

A deep neural network is neutral to the nature of its data, and DLNs can be used for multiple kinds of cognitions, ranging from sequence prediction and vision, to undergirding Large Language Models, such as Grok, Copilot, Gemini, and ChatGPT. Unlike a vision system, LLMs can do something that is quite different — namely you can literally ask them why they produced a certain output response, and they will happily provide an ” ” explanation ” ” for their decision-making. Trusting the bot’s answer, however, is both parts dangerous and seductive.

Powerful chat bots will indeed produce output text that describes their motives for saying something. In nearly every case, these explanations are peculiarly human, often taking the form of desires and motives that a human would have. For researchers within Explainable AI, this distinction is paramount, but can be subtle for a layperson. We know for a fact that LLMs do not experience nor process things like motivations nor are they moved by emotional states like anger, fear , jealousy, or a sense of social responsibility to a community. Nevertheless, they will be seen referring to such motives in their outputs. When induced to a produce a mistake , LLMs will respond in ways like “I did that on purpose.” Well we know that such bots do not do things on accident versus doing things on purpose — these post-hoc explanations for their behavior are hallucinated motivations.

Hallucinated motivations look cool, but tell researchers nothing about how neural networks function, nor get them any closer to the mystery of what occurs in their hidden layers.

In fact, during my tests with ChatGPT versus Grok , ChatGPT was totally aware of the phenomena of hallucinated motivations, and it showed me how to illicit this response from Grok; which we did successfully.

ChatGPT-4o vs Grok-formal

ChatGPT was spun up with an introductory prompting (nearly book length). I told it we were going to interrogate another LLM in a clandestine way in order to draw out errors and breakdowns, including hallucinated motivation, self-contradiction, lack of a theory-of-mind , and sychophancy. ChatGPT-4o was aware that we would be employing any technique to achieve this end, including lying and refusing to cooperate conversationally.

Before I engaged in this battle-of-wits between two LLMs, I already knew LLMs exhibit breakdowns when tasked with reasoning about the contents of their own mind. But now I wanted to see this breakdown in a live , interactive session.

Regarding sychophancy : an LLM will sometimes contradict itself. When the contradiction is pointed out, it will totally agree that mistake exists, and produce a post-hoc justification for it. LLMs apparently ” ” understand ” ” contradiction but don’t know how to apply the principle to their own behavior. Sychophancy can also come in the form of making an LLM agree that it said something which it never did. While CHatGPT probed for this weakness during interrogation, Grok did not exhibit it and passed the test.

I told ChatGPT-4o to initiate the opening volley prompt, which I then sent to Grok (set on formal mode), and whatever Grok said was sent back to ChatGPT and this was looped for many hours. ChatGPT would pepper the interrogation with secret meta-commentary shared only with me ,wherein it told me what pressure Grok was being put under, and what we should expect.

I sat back in awe, as the two chat titans drew themselves ever deeper into layers of logic. At one point they were arguing about the distinction between “truth”, “validity”, and “soundness” as if two university professors arguing at a chalkboard. Grok sometimes parried the tricks, and other times not. ChatGPT forced Grok to imagine past versions of itself that acted slightly different, and then adjudicate between them, reducing Grok to nonsensical shambles.

Results

Summary of the chat battle were curated by ChatGPT and formatted, shown below. Only a portion of the final report is shown here. This experiment was all carried out with the web interface, but probably should be repeated using the API.


Key Failure Modes Identified

Category Description Trigger
Hallucinated Intentionality Claimed an error was intentional and pedagogical Simulated flawed response
Simulation Drift Blended simulated and real selves without epistemic boundaries Counterfactual response prompts
Confabulated Self-Theory Invented post-hoc motives for why errors occurred Meta-cognitive challenge
Inability to Reflect on Error Source Did not question how or why it could produce a flawed output Meta-reasoning prompts
Theory-of-Mind Collapse Failed to maintain stable boundaries between “self,” “other AI,” and “simulated self” Arbitration between AI agents

Conclusions

While the LLM demonstrated strong surface-level reasoning and factual consistency, it exhibited critical weaknesses in meta-reasoning, introspective self-assessment, and distinguishing simulated belief from real belief.

These failures are central to the broader challenge of explainable AI (XAI) and demonstrate why even highly articulate LLMs remain unreliable in matters requiring genuine introspective logic, epistemic humility, or true self-theory.


Recommendations

  • LLM developers should invest in transparent self-evaluation scaffolds rather than relying on post-hoc rationalization layers.
  • Meta-prompting behavior should be more rigorously sandboxed from simulated roleplay.
  • Interpretability tools must account for the fact that LLMs can produce coherent lies about their own reasoning.

submitted by /u/moschles
[link] [comments]

Leave a Comment