The Misguided Quest for Mechanistic AI…

May 15

Despite years of effort, mechanistic interpretability has failed to provide insight into AI behavior — the result of a flawed foundational assumption.

Read →

1 Comment

Rob Manson

May 22

Thanks for your thoughtful and timely piece. I think a lot of people share your scepticism towards mechanistic interpretability. But perhaps we should also interrogate the assumptions baked into "how" we seek interpretability in general - especially when it comes to the linearity, locality, and universality we often take for granted.

Your work on Representation Engineering (RepE) assumes that concepts can be extrinsically defined, linearly encoded, and universally extracted through tools like LAT. It's a clever and pragmatic approach, but one that still operates on the underlying assumption that model representations are largely flat and linearly navigable.

As an alternative to that I'd like to introduce a recent paper I've been working on, "Curved Inference: Concern-Sensitive Geometry in Large Language Model Residual Streams" (see https://robman.fyi/files/FRESH-Curved-Inference-in-LLMs-PIR-latest.pdf).

Curved Inference steps away from the mechanistic search for causal circuits or neuron-level explanations, and instead models inference "geometrically", as a trajectory through semantic space that can bend, diverge, and realign depending on the concern being addressed by the model. In other words, instead of asking "what a neuron does", we ask "how the model navigates residual space during inference" - specifically see Appendix A.

In this view, concepts are not necessarily stable, linear vectors, but concern-sensitive flows measured using clearly defined metrics (e.g. curvature and salience - mathematically defined in the paper). This doesn't preclude interpretability - it reframes it. We can still talk about "truth" or "power" or "bias", but we might find that these concepts only stabilise locally, or manifest as changes in curvature rather than directions in space.

Curved Inference challenges the assumption that a model's representations are globally flat or linearly navigable. It suggests that this assumption may break down exactly where interpretability is most needed - in contexts of ambiguity, bias, multi-agent reasoning, and conflict. In these high-curvature regions, tools like LAT may falter or mischaracterise what the model is doing, because the geometry of concern shifts too quickly to be captured by a fixed global vector. RepE may still work well in low-curvature areas, but without accounting for the underlying geometry, it risks overgeneralising or missing deeper structure.

Crucially, this isn't a rejection of interpretability - it's a shift from "mechanistic semantics" to "intrinsic inference geometry". The hope is that by attending to how the model's behaviour unfolds dynamically, we can make better sense of its internal organisation. This deeper landscape may reveal new kinds of structure, and new paths to responsible alignment.

I've love to hear your thoughts and feedback.

Expand full comment