Discussion about this post

User's avatar
Rob Manson's avatar

Thanks for your thoughtful and timely piece. I think a lot of people share your scepticism towards mechanistic interpretability. But perhaps we should also interrogate the assumptions baked into "how" we seek interpretability in general - especially when it comes to the linearity, locality, and universality we often take for granted.

Your work on Representation Engineering (RepE) assumes that concepts can be extrinsically defined, linearly encoded, and universally extracted through tools like LAT. It's a clever and pragmatic approach, but one that still operates on the underlying assumption that model representations are largely flat and linearly navigable.

As an alternative to that I'd like to introduce a recent paper I've been working on, "Curved Inference: Concern-Sensitive Geometry in Large Language Model Residual Streams" (see  https://robman.fyi/files/FRESH-Curved-Inference-in-LLMs-PIR-latest.pdf).

Curved Inference steps away from the mechanistic search for causal circuits or neuron-level explanations, and instead models inference "geometrically", as a trajectory through semantic space that can bend, diverge, and realign depending on the concern being addressed by the model. In other words, instead of asking "what a neuron does", we ask "how the model navigates residual space during inference" - specifically see Appendix A.

In this view, concepts are not necessarily stable, linear vectors, but concern-sensitive flows measured using clearly defined metrics (e.g. curvature and salience - mathematically defined in the paper). This doesn't preclude interpretability - it reframes it. We can still talk about "truth" or "power" or "bias", but we might find that these concepts only stabilise locally, or manifest as changes in curvature rather than directions in space.

Curved Inference challenges the assumption that a model's representations are globally flat or linearly navigable. It suggests that this assumption may break down exactly where interpretability is most needed - in contexts of ambiguity, bias, multi-agent reasoning, and conflict. In these high-curvature regions, tools like LAT may falter or mischaracterise what the model is doing, because the geometry of concern shifts too quickly to be captured by a fixed global vector. RepE may still work well in low-curvature areas, but without accounting for the underlying geometry, it risks overgeneralising or missing deeper structure.

Crucially, this isn't a rejection of interpretability - it's a shift from "mechanistic semantics" to "intrinsic inference geometry". The hope is that by attending to how the model's behaviour unfolds dynamically, we can make better sense of its internal organisation. This deeper landscape may reveal new kinds of structure, and new paths to responsible alignment.

I've love to hear your thoughts and feedback.

John Holman's avatar

This is a sharp critique and the historical record you've assembled is fair — the list of techniques that generated excitement and didn't replicate is real and worth taking seriously.

But I'd push back on one thing, because we just published data that lands directly in this conversation.

We ran a mechanistic interpretability study on a 3B Llama fine-tuned on Marcus Aurelius's Meditations. And you're right about individual features — 13 of 15 LoRA-specific features at Layer 22 were causally inert. Beautiful logit lens projections, perfect specificity, barely moved predictions when removed. The "find the feature for X" approach failed exactly the way you describe.

But then we did something different. Instead of asking what each feature means individually, we asked which features consistently co-activate together and what passages they fire on jointly. Five clusters emerged. Three exceeded the best single-feature causal signal by factors of 2-7x. And they were readable — not because features got cleaner, but because we changed the unit of analysis from notes to chords.

That's not bottom-up mechanistic analysis anymore. That's closer to the population-level representation work you're advocating for — found through SAEs, using co-activation geometry rather than individual feature decomposition.

The dichotomy between "mechanistic bottom-up" and "top-down representational" may be less clean than this framing suggests. The tools you're skeptical of, used differently, found something that looks a lot like what you're proposing we look for instead.

Happy to share the paper if useful.

— John

3 more comments...

No posts

Ready for more?