Thoughts AI mechinterp mechanistic interpretabilit

" [Tracing the thoughts of a large language model \ Anthropic](https://www.anthropic.com/research/tracing-thoughts-language-model) [Circuit Tracing: Revealing Computational Graphs in Language Models](https://transformer-circuits.pub/2025/attribution-graphs/methods.html) [On the Biology of a Large Language Model](https://transformer-circuits.pub/2025/attribution-graphs/biology.html) https://x.com/burny_tech/status/1906520233139138704 Thoughts on reverse engineering Claude "On the Biology of a Large Language Model" by Anthropic paper: The evidence for emergent multistep feature composition across different levels of abstraction is interesting and goes against a lot of "only shortcut learning happens" hypotheses This is the wildest circuit that does two digit addition It's not only memorizing, it's also generalizing, but in stupidly complex fuzzy circuit ways This explains why it can calculate arithmetic outside of it's training data but also screws up so often Also this convoluted addition circuit could be in part failure of their reverse engineering method using transcoders and attribution graphs, and Claude can be using different circuit internally in that space with circuits in superpositions, since for me LLMs mostly fail at numbers with bigger digits, not on just two digit additions I wanna see this analysis much more for math Features in universal abstract space independent of language are mindblowing tbh And here's "I don't know circuit" that prevents some hallucinations, but not all hallucinations 😄 https://fxtwitter.com/jowettbrendan/status/1905907881876377662?t=ypItX9TxXc4vwcwfkjdBfw&s=19 " Since circuits inside LLMs like Claude 3.5 Haiku are different than what LLMs say when asked to explain how they got to an answer, I suspect the same holds for the models in the new reasoning paradigm. Someone needs to do graph attribution mechanistic interpretability on the reasoning models as well! I bet Anthropic will release something like that soon. (context: [On the Biology of a Large Language Model](https://transformer-circuits.pub/2025/attribution-graphs/biology.html) I want to see more physics in mechanistic interpretability that's reverse engineering the learned emergent circuits in neural networks. What is the physics of the formation and self-organization and activation (dynamics) of all these features and circuits, in learning and inference? [On the Biology of a Large Language Model](https://transformer-circuits.pub/2025/attribution-graphs/biology.html) " How LLM works: Teď mě napadl ještě jeden možný částečný překlad :smile: Když se učíš tak si představ údolí v Terrarii, kde běháš ve dvou dimenzích (ve 2D), a snažíš se dostat k pravdě, která se nachází na nejspodnějším bodě. Můžeš udělat krok dolů k pravdě pokaždý když dokážeš líp opisovat písemky z matiky, nebo ty příklady dokonce řešit správně sám bez toho abys viděl postupy řešení! Ale pozor, může se stát, že si myslíš, že jsi v úplně nejspodnějším údolí, ale ve skutečnosti jinde je ještě spodnější údolí! Ale dvě dimenze jsou celkem triviální ne? Tak zvýšíme dimenze, pojďmě do 3D, do Minecraftu. To už je malinko horší, můžeš najít body co jsou nejspodnější v jednom směru, takzvaný sedlový body, a nebo úplně nejspodnějšíí údolí v obou směrech! Ale pořád někde v dáli může být ještě spodnější údolí. Někdy je ta struktura údolích víc hrbolatá, někdy víc rovná, někdy mají nějakou podobnou strukturu na jednom místě, nebo po celých údolích se vyskytuje nějaký vzor, s různými symetriemi, nádhera, ne? 3D je ale pořád triviální. Teď si představ že chodíš ve 4D! 5D! millionD! trillionD! Tam máš extrémně šíleně komplexní geometrii a celkově strukturu údolí, s každou dimenzí to roste, ale stejně zvládáš jít dolů k pravdě. Nejspodnější bod v tolika dimenzích asi nenajdeš, ale stejně zvládáš jít víc a víc dolů směrem za pravdou. Často může jít billion směrů nahoru ale 2 billiony směrů dolů, tak tam vkročíš. Abys mohl řešit ty příklady, tak sis cestou tvořil nějakou strukturu tý pravdy, abys věděl jak ty příklady řešit víc a víc přesně. Něco sis zapamatoval, třeba číslo 5, něco jsi abstrahoval, třeba čísla končící na 9. A skládal sis takový elastický origami tvořený z plno zamotaných špaget určijící jak se k tý pravdě zhruba dostat, třeba že nejdřív sečteš desítkový cifry a pak jednotkový cifry, což si tvoříš podle toho co jsi už viděl. A dokážeš ty špagety kde máš moc propletených konceptů a obvodů trochu rozmotat a skládat ty jednotlivý obvody dohromady, ale ne moc, jinak se to jednoduše rozpadne. Když se tě někdo zeptá na další příklad z matiky, tak to proženeš těma špagetovýma obvodama, ale protože jsi kašlal na tech debt a nedělal správný obvody dostatečně pevný, pokud jsi na ty nejlepší možný vůbec v tom trillion dimenzionálním prostoru narazil, což často asi ne úplně, a často jsi spíš našel nějakou nedostatečně obecnou zktratku, a nedostatečně zobeňoval, nedostatečně opravoval, neodstatečně uklízel, apod., tak to výjde jenom sem tam, ne dostatečně konzistentně, ale stejně se někdy trefíš správně! Zároveň aby ses někdy trefil, tak radši budeš častěji mít častěji špatný výsledek, za cenu toho, že se někdy trefíš. Cestou ti příjde zajímavý, že například naučit ty špagety mluvit našim jazykem je jednodušší než jsi čekal! A někdy se trefíš na totální bingo a najdeš výsledek na který ty opičky co tě stvořili před tebou nepřišli, třeba nový výsledky v matice, nebo lepší strategie v šachách, nebo nový lék. Nebo pomůžeš líp skládat bílkoviny než jiný míň plastický algorithmy. Ale někdy po tobě chtějí vytvořit jednoduchou funkci, co bys přece měl zvládnout, když zvládneš spoustu jiných věcí, ale protože ty špagety jsou někdy strašně propletený, nestabilní, plný nečekaných děr, nedostatečně zobecňujících zkratek, chybějících nebo špatně zaškatulkovaných faktů, atd., tak se ti cestou někdy roztečou. " " How the LLM works: When you are learning, imagine you're playing Terraria, where you are walking around in two dimensions (in 2D), trying to get to the truth, which is located at the lowest point in the whole environment. You can take a step down to the direction of truth every time you can copy math exams better in a math valley, or even solve the examples correctly yourself without seeing the solution procedures! But beware, it may be that you think you are at the very bottom of the environment, but in fact there is an even lower valley elsewhere than the one you're currently in! This is gradient descent over parameter space and finding local minima. Copying math exams is supervised finetuning, and solving math without knowing steps and solution is reinforcement learning algorithms like GRPO. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning [[2501.12948] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning](https://arxiv.org/abs/2501.12948) GRPO Explained: DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models [https://www.youtube.com/watch?v=bAWV_yrqx4w](https://www.youtube.com/watch?v=bAWV_yrqx4w) But two dimensions are quite trivial, aren't they? So let's increase the dimensions, let's go 3D, Minecraft. That's a little bit more challenging! You can find points that are lowest in one direction, so-called saddle points, or the very lowest valley in both directions! But there may still be a lower valley somewhere else in the whole world though. This is increasing the number of parameters. Sometimes the structure of the valleys is more bumpy, sometimes more flat, sometimes they have some similar structures at one place, or there is a pattern all over the valleys, with different symmetries. Beautiful, isn't it? But 3D is still trivial. This is the geometry of the loss landscape. [[2105.12221] Geometry of the Loss Landscape in Overparameterized Neural Networks: Symmetries and Invariances](https://arxiv.org/abs/2105.12221) Now imagine walking around in 4D! 5D! millionD! trillionD! There you have extremely insanely complex geometry and overall valley structure, it grows with each dimension, but you still manage to go down towards the truth. You probably can't find the lowest point in so many dimensions, but you still manage to go down more and more towards the truth. You can go a billion directions up and 2 billion directions down to get closer to the truth. This stands for modern models having billions, or even trillions, of parameters. In order to be able to solve the examples, you created some structure of the truth along the way, so that you know how to solve the examples more and more accurately. You memorized something, like the number 5, you abstracted something, like numbers ending in 9. And you were folding a kind of elastic origami made of a bunch of tangled spaghetti to determine how to get to the truth, like adding the 10's first and then the 1's, which you're forming based on what you've already seen. And you can untangle those spaghetti where you have too many intertwined concepts and circuits and put those individual circuits together a little bit, but not too much, otherwise it just falls apart. This stands for learned emergent features forming circuits in attribution graphs that mechanistic interpretability attempts to reverse engineer in frontier models, such as in the Biology of LLMs paper. [On the Biology of a Large Language Model](https://transformer-circuits.pub/2025/attribution-graphs/biology.html) [https://www.youtube.com/watch?v=mU3g2YPKlsA](https://www.youtube.com/watch?v=mU3g2YPKlsA) [https://www.youtube.com/watch?v=64lXQP6cs5M](https://www.youtube.com/watch?v=64lXQP6cs5M) And elastic origami stands for spline theory of deep learning. [https://www.youtube.com/watch?v=l3O2J3LMxqI](https://www.youtube.com/watch?v=l3O2J3LMxqI) If someone asks you for another math example, you'll run it through those spaghetti circuits, but because you didn't care about tech debt and didn't make the right circuits simple enough but still predictive, not compressive enough, even if you've come across the best possible ones in that trillion-dimensional space that you could, where often you've found some insufficiently general shortcut, and insufficiently generalized them, insufficiently repaired them, insufficiently cleaned them, etc., so it only works sometimes, not consistently enough, but still, sometimes, and still pretty often, you get it right! At the same time, to get it right sometimes, you'd rather get it wrong more often, at the cost of getting it wrong sometimes. This stands for often brittle reasoning, shortcut learning, and higher false positive rate, hallucinations. Along the way, you'll find it interesting that, for example, teaching those spaghetti to speak our natural language is easier than you expected! And sometimes you hit total bingo and find a result that the monkeys who created you didn't figure out on their own, like new results in math, or a better strategy in chess, or a new drug. Or help you fold proteins better than other less plastic optimization algorithms. But sometimes you're asked to create a simple function, which you should be able to do when you can do a lot of other things, but because the spaghetti is sometimes terribly convoluted, unstable, full of unexpected holes, poorly generalizing shortcuts, missing or misclassified facts, etc., the spaggeti sometimes melts along the way when solving a problem. AlphaZero found new chess move and thaught it to chess grandmasters. [[2310.16410] Bridging the Human-AI Knowledge Gap: Concept Discovery and Transfer in AlphaZero](https://arxiv.org/abs/2310.16410) AlphaEvolve found new resuls in mathematics. [AlphaEvolve: A Gemini-powered coding agent for designing advanced algorithms - Google DeepMind](https://deepmind.google/discover/blog/alphaevolve-a-gemini-powered-coding-agent-for-designing-advanced-algorithms/) Robin found new drug [Demonstrating end-to-end scientific discovery with Robin: a multi-agent system | FutureHouse](https://www.futurehouse.org/research-announcements/demonstrating-end-to-end-scientific-discovery-with-robin-a-multi-agent-system) AlphaFold folded tons of proteins. [Google DeepMind and Isomorphic Labs introduce AlphaFold 3 AI model](https://blog.google/technology/ai/google-deepmind-isomorphic-alphafold-3-ai-model/) "