Thanks for your thoughtful and timely piece. I think a lot of people share your scepticism towards mechanistic interpretability. But perhaps we should also interrogate the assumptions baked into "how" we seek interpretability in general - especially when it comes to the linearity, locality, and universality we often take for granted.
Your work on Representation Engineering (RepE) assumes that concepts can be extrinsically defined, linearly encoded, and universally extracted through tools like LAT. It's a clever and pragmatic approach, but one that still operates on the underlying assumption that model representations are largely flat and linearly navigable.
Curved Inference steps away from the mechanistic search for causal circuits or neuron-level explanations, and instead models inference "geometrically", as a trajectory through semantic space that can bend, diverge, and realign depending on the concern being addressed by the model. In other words, instead of asking "what a neuron does", we ask "how the model navigates residual space during inference" - specifically see Appendix A.
In this view, concepts are not necessarily stable, linear vectors, but concern-sensitive flows measured using clearly defined metrics (e.g. curvature and salience - mathematically defined in the paper). This doesn't preclude interpretability - it reframes it. We can still talk about "truth" or "power" or "bias", but we might find that these concepts only stabilise locally, or manifest as changes in curvature rather than directions in space.
Curved Inference challenges the assumption that a model's representations are globally flat or linearly navigable. It suggests that this assumption may break down exactly where interpretability is most needed - in contexts of ambiguity, bias, multi-agent reasoning, and conflict. In these high-curvature regions, tools like LAT may falter or mischaracterise what the model is doing, because the geometry of concern shifts too quickly to be captured by a fixed global vector. RepE may still work well in low-curvature areas, but without accounting for the underlying geometry, it risks overgeneralising or missing deeper structure.
Crucially, this isn't a rejection of interpretability - it's a shift from "mechanistic semantics" to "intrinsic inference geometry". The hope is that by attending to how the model's behaviour unfolds dynamically, we can make better sense of its internal organisation. This deeper landscape may reveal new kinds of structure, and new paths to responsible alignment.
This is a sharp critique and the historical record you've assembled is fair — the list of techniques that generated excitement and didn't replicate is real and worth taking seriously.
But I'd push back on one thing, because we just published data that lands directly in this conversation.
We ran a mechanistic interpretability study on a 3B Llama fine-tuned on Marcus Aurelius's Meditations. And you're right about individual features — 13 of 15 LoRA-specific features at Layer 22 were causally inert. Beautiful logit lens projections, perfect specificity, barely moved predictions when removed. The "find the feature for X" approach failed exactly the way you describe.
But then we did something different. Instead of asking what each feature means individually, we asked which features consistently co-activate together and what passages they fire on jointly. Five clusters emerged. Three exceeded the best single-feature causal signal by factors of 2-7x. And they were readable — not because features got cleaner, but because we changed the unit of analysis from notes to chords.
That's not bottom-up mechanistic analysis anymore. That's closer to the population-level representation work you're advocating for — found through SAEs, using co-activation geometry rather than individual feature decomposition.
The dichotomy between "mechanistic bottom-up" and "top-down representational" may be less clean than this framing suggests. The tools you're skeptical of, used differently, found something that looks a lot like what you're proposing we look for instead.
All of this content is new to me. Ive only recently (past 18 months) started to dive deeply into LLMs, neural networks, and related topics. At first, my purpose was to understand these frontier technologies from a business perspective. But now I'm thinking about topics like whole brain emulation, mechanistic interoperability, consciousness, superintelligence, and more. I find it intriguing that we have cannot emulate the entire human brain, a biological artifact, and yet we're already trying to deconstruct complex articial networks. It makes me wonder whether we simply need to advance to such a high degree of complexity in neural networks that they can unravel the lower complexity of our biological brains. In other words, maybe we're not the substrate that is capable of figuring these things out.
This argument against mechinterp applies equally to RepE. Where SAEs fail due to nonhuman concepts, so will representations, edge cases being incorrectly captured. Where SAEs failed due to widely-distributed concepts, RepE would fail due to tightly-integrated concepts. Edge cases prohibit mechinterp-simplified models from achieving good performance, but they also mean that RepE-influenced models will not be influenced in the way they're meant to be.
This is just another specific form of mechinterp with all the same likely flaws, and seems noticeably ignorant of that similarity.
Mechanistic interpretability and representation engineering produce findings without explicit witness apparatus, and that is what makes both vulnerable to the replication failures the post documents.
The MI failure pattern across feature visualizations, saliency maps, BERT-illusion, Chinchilla circuit analysis, and SAEs is the same mechanism at different scales: an interpretation gets identified, looks compelling, fails to transfer across contexts or perturbation. The interpretations were never anchored to their derivation chain (WHENCE), their temporal regime (WHEN), the alternatives they were selected against (WHICH), the contexts they were valid in (WHERE), or the objectives they were addressing (FOR-WHAT). Without witness-probing infrastructure, every interpretation is statistically extracted rather than structurally anchored. Cross-context drift is then the predictable outcome, not the surprising one.
RepE has the right direction but the same missing piece. Top-down representation analysis at the emergent-property level catches what bottom-up MI misses, but it still operates without explicit interrogative apparatus. The findings will be more robust than MI findings because higher-level patterns transfer better than individual-feature claims, but they will still drift across contexts because the patterns themselves are not witness-anchored.
Rob Manson's Curved Inference comment and John Holman's co-activation cluster comment in this thread are both reaching for what witness apparatus would supply. Manson's concern-sensitive geometry is FOR-WHAT-witness deployment without the vocabulary. Holman's chords-not-notes co-activation analysis is TOGETHER/ALONE plus GOES-WITH operator deployment plus implicit WHEN/WHICH witnessing of the co-activation patterns. Both converge on the same upstream structural need: pattern-extraction has to be witness-probed to produce stable interpretations.
The seven witnesses (WHAT, WHERE, WHICH, WHEN, FOR-WHAT, HOW, WHENCE) close over the interrogative space for any claim about what a model component or representation IS doing. Applied as probes at each interpretation-claim, they turn the claim from statistically-extracted to structurally-anchored. Without them, even the right top-down approach produces interpretations that drift the way MI's bottom-up approach did.
Hey Dan and Laura — the weather analogy is brilliant. That one image of trying to predict a hurricane by tracking individual air molecules immediately reframes why bottom-up interpretability feels so intractable. It makes the problem obvious in a way that pages of technical argument sometimes don't.
▎ From the practitioner side, this resonates hard. I build AI agents for small businesses, and the trust question clients actually ask is never "how does it work inside?" — it's "what happens when it hits something weird?" Scoped permissions, escalation paths, kill switches. That's the interpretability layer people are actually buying. Representation engineering feels like it lives at that same practical altitude — working where the patterns are legible instead of digging into substrate that resists inspection by design.
▎ Really sharp piece. The track record section alone (saliency maps performing the same on trained vs. random models) deserves its own post. Looking forward to more from you two.
Thanks for your thoughtful and timely piece. I think a lot of people share your scepticism towards mechanistic interpretability. But perhaps we should also interrogate the assumptions baked into "how" we seek interpretability in general - especially when it comes to the linearity, locality, and universality we often take for granted.
Your work on Representation Engineering (RepE) assumes that concepts can be extrinsically defined, linearly encoded, and universally extracted through tools like LAT. It's a clever and pragmatic approach, but one that still operates on the underlying assumption that model representations are largely flat and linearly navigable.
As an alternative to that I'd like to introduce a recent paper I've been working on, "Curved Inference: Concern-Sensitive Geometry in Large Language Model Residual Streams" (see https://robman.fyi/files/FRESH-Curved-Inference-in-LLMs-PIR-latest.pdf).
Curved Inference steps away from the mechanistic search for causal circuits or neuron-level explanations, and instead models inference "geometrically", as a trajectory through semantic space that can bend, diverge, and realign depending on the concern being addressed by the model. In other words, instead of asking "what a neuron does", we ask "how the model navigates residual space during inference" - specifically see Appendix A.
In this view, concepts are not necessarily stable, linear vectors, but concern-sensitive flows measured using clearly defined metrics (e.g. curvature and salience - mathematically defined in the paper). This doesn't preclude interpretability - it reframes it. We can still talk about "truth" or "power" or "bias", but we might find that these concepts only stabilise locally, or manifest as changes in curvature rather than directions in space.
Curved Inference challenges the assumption that a model's representations are globally flat or linearly navigable. It suggests that this assumption may break down exactly where interpretability is most needed - in contexts of ambiguity, bias, multi-agent reasoning, and conflict. In these high-curvature regions, tools like LAT may falter or mischaracterise what the model is doing, because the geometry of concern shifts too quickly to be captured by a fixed global vector. RepE may still work well in low-curvature areas, but without accounting for the underlying geometry, it risks overgeneralising or missing deeper structure.
Crucially, this isn't a rejection of interpretability - it's a shift from "mechanistic semantics" to "intrinsic inference geometry". The hope is that by attending to how the model's behaviour unfolds dynamically, we can make better sense of its internal organisation. This deeper landscape may reveal new kinds of structure, and new paths to responsible alignment.
I've love to hear your thoughts and feedback.
This is a sharp critique and the historical record you've assembled is fair — the list of techniques that generated excitement and didn't replicate is real and worth taking seriously.
But I'd push back on one thing, because we just published data that lands directly in this conversation.
We ran a mechanistic interpretability study on a 3B Llama fine-tuned on Marcus Aurelius's Meditations. And you're right about individual features — 13 of 15 LoRA-specific features at Layer 22 were causally inert. Beautiful logit lens projections, perfect specificity, barely moved predictions when removed. The "find the feature for X" approach failed exactly the way you describe.
But then we did something different. Instead of asking what each feature means individually, we asked which features consistently co-activate together and what passages they fire on jointly. Five clusters emerged. Three exceeded the best single-feature causal signal by factors of 2-7x. And they were readable — not because features got cleaner, but because we changed the unit of analysis from notes to chords.
That's not bottom-up mechanistic analysis anymore. That's closer to the population-level representation work you're advocating for — found through SAEs, using co-activation geometry rather than individual feature decomposition.
The dichotomy between "mechanistic bottom-up" and "top-down representational" may be less clean than this framing suggests. The tools you're skeptical of, used differently, found something that looks a lot like what you're proposing we look for instead.
Happy to share the paper if useful.
— John
All of this content is new to me. Ive only recently (past 18 months) started to dive deeply into LLMs, neural networks, and related topics. At first, my purpose was to understand these frontier technologies from a business perspective. But now I'm thinking about topics like whole brain emulation, mechanistic interoperability, consciousness, superintelligence, and more. I find it intriguing that we have cannot emulate the entire human brain, a biological artifact, and yet we're already trying to deconstruct complex articial networks. It makes me wonder whether we simply need to advance to such a high degree of complexity in neural networks that they can unravel the lower complexity of our biological brains. In other words, maybe we're not the substrate that is capable of figuring these things out.
This argument against mechinterp applies equally to RepE. Where SAEs fail due to nonhuman concepts, so will representations, edge cases being incorrectly captured. Where SAEs failed due to widely-distributed concepts, RepE would fail due to tightly-integrated concepts. Edge cases prohibit mechinterp-simplified models from achieving good performance, but they also mean that RepE-influenced models will not be influenced in the way they're meant to be.
This is just another specific form of mechinterp with all the same likely flaws, and seems noticeably ignorant of that similarity.
Mechanistic interpretability and representation engineering produce findings without explicit witness apparatus, and that is what makes both vulnerable to the replication failures the post documents.
The MI failure pattern across feature visualizations, saliency maps, BERT-illusion, Chinchilla circuit analysis, and SAEs is the same mechanism at different scales: an interpretation gets identified, looks compelling, fails to transfer across contexts or perturbation. The interpretations were never anchored to their derivation chain (WHENCE), their temporal regime (WHEN), the alternatives they were selected against (WHICH), the contexts they were valid in (WHERE), or the objectives they were addressing (FOR-WHAT). Without witness-probing infrastructure, every interpretation is statistically extracted rather than structurally anchored. Cross-context drift is then the predictable outcome, not the surprising one.
RepE has the right direction but the same missing piece. Top-down representation analysis at the emergent-property level catches what bottom-up MI misses, but it still operates without explicit interrogative apparatus. The findings will be more robust than MI findings because higher-level patterns transfer better than individual-feature claims, but they will still drift across contexts because the patterns themselves are not witness-anchored.
Rob Manson's Curved Inference comment and John Holman's co-activation cluster comment in this thread are both reaching for what witness apparatus would supply. Manson's concern-sensitive geometry is FOR-WHAT-witness deployment without the vocabulary. Holman's chords-not-notes co-activation analysis is TOGETHER/ALONE plus GOES-WITH operator deployment plus implicit WHEN/WHICH witnessing of the co-activation patterns. Both converge on the same upstream structural need: pattern-extraction has to be witness-probed to produce stable interpretations.
The seven witnesses (WHAT, WHERE, WHICH, WHEN, FOR-WHAT, HOW, WHENCE) close over the interrogative space for any claim about what a model component or representation IS doing. Applied as probes at each interpretation-claim, they turn the claim from statistically-extracted to structurally-anchored. Without them, even the right top-down approach produces interpretations that drift the way MI's bottom-up approach did.
Hey Dan and Laura — the weather analogy is brilliant. That one image of trying to predict a hurricane by tracking individual air molecules immediately reframes why bottom-up interpretability feels so intractable. It makes the problem obvious in a way that pages of technical argument sometimes don't.
▎ From the practitioner side, this resonates hard. I build AI agents for small businesses, and the trust question clients actually ask is never "how does it work inside?" — it's "what happens when it hits something weird?" Scoped permissions, escalation paths, kill switches. That's the interpretability layer people are actually buying. Representation engineering feels like it lives at that same practical altitude — working where the patterns are legible instead of digging into substrate that resists inspection by design.
▎ Really sharp piece. The track record section alone (saliency maps performing the same on trained vs. random models) deserves its own post. Looking forward to more from you two.
This is a strong critique—especially the point that complex systems don’t yield clean, mechanistic explanations. That part feels right.
But I think there’s a layer even deeper than bottom-up vs. top-down.
Both approaches assume the same thing:
that the system should be allowed to decide—and we just need to understand or steer it better.
That’s where I think the real risk sits.
The failure mode isn’t just opacity or complexity. It’s that something becomes a decision in the first place that shouldn’t have been.
Interpretability—mechanistic or top-down—kicks in after that line has already been crossed.
So the question I keep coming back to is:
how do we determine what should never become decidable or offloadable at all?
Feels like that layer is still mostly missing from the conversation.
Interpretability will define whether AI becomes trusted infrastructure or just powerful guesswork.