Resources theory reverse engineering mechinterp an

[GitHub - ruizheliUOA/Awesome-Interpretability-in-Large-Language-Models: This repository collects all relevant resources about interpretability in LLMs](https://github.com/ruizheliUOA/Awesome-Interpretability-in-Large-Language-Models) [Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet](https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html) Arena sparse autoencoders, activation patching [GitHub - ai-safety-foundation/sparse_autoencoder: Sparse Autoencoder for Mechanistic Interpretability](https://github.com/ai-safety-foundation/sparse_autoencoder) [GitHub - open-thought/system-2-research: System 2 Reasoning Link Collection](https://github.com/open-thought/system-2-research) [GitHub - lrnzgiusti/awesome-topological-deep-learning: A curated list of topological deep learning (TDL) resources and links.](https://github.com/lrnzgiusti/awesome-topological-deep-learning) Levels of AGI for Operationalizing Progress on the Path to AGI [[2311.02462] Levels of AGI for Operationalizing Progress on the Path to AGI](https://arxiv.org/abs/2311.02462) [GitHub - jbloomAus/SAELens: Training Sparse Autoencoders on Language Models](https://github.com/jbloomAus/SAELens) [My best guess at the important tricks for training 1L SAEs — LessWrong](https://www.lesswrong.com/posts/fifPCos6ddsmJYahD/my-best-guess-at-the-important-tricks-for-training-1l-saes) [GitHub - ai-safety-foundation/sparse_autoencoder: Sparse Autoencoder for Mechanistic Interpretability](https://github.com/ai-safety-foundation/sparse_autoencoder/tree/main) [sparse_autoencoder/sparse_autoencoder/train/pipeline.py at main · ai-safety-foundation/sparse_autoencoder · GitHub](https://github.com/ai-safety-foundation/sparse_autoencoder/blob/main/sparse_autoencoder/train/pipeline.py) [Sparse Autoencoder](https://ai-safety-foundation.github.io/sparse_autoencoder/) [GitHub - EleutherAI/sparsify: Sparsify transformers with SAEs and transcoders](https://github.com/EleutherAI/sae) [LLMs develop their own understanding of reality as their language abilities improve | MIT News | Massachusetts Institute of Technology](https://news.mit.edu/2024/llms-develop-own-understanding-of-reality-as-language-abilities-improve-0814) [Emergent Representations of Program Semantics in Language Models Trained on Programs | OpenReview](https://openreview.net/forum?id=8PTx4CpNoT&referrer=%5Bthe%20profile%20of%20Charles%20Jin%5D(%2Fprofile%3Fid%3D~Charles_Jin1) Ml for good mechintero Camlab [Evaluating feature steering: A case study in mitigating social biases \ Anthropic](https://www.anthropic.com/research/evaluating-feature-steering) I see hope in superintelligence alignment using mechanistic interpretability and weak to strong generalization Mechanistic interpretability: [An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2 — AI Alignment Forum](https://www.alignmentforum.org/posts/NfFST5Mio7BCAQHPA/an-extremely-opinionated-annotated-list-of-my-favourite-1) Weak to strong generalization: [Weak-to-strong generalization | Semantic Scholar](https://www.semanticscholar.org/search?q=Weak-to-strong%20generalization&sort=relevance) Singular learning theory [Redirecting to: https://timaeus.co/projects](https://devinterp.com/projects) [[2305.11169] Emergent Representations of Program Semantics in Language Models Trained on Programs](https://arxiv.org/abs/2305.11169) Neel nanda An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2 [An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2 — AI Alignment Forum](https://www.alignmentforum.org/posts/NfFST5Mio7BCAQHPA/an-extremely-opinionated-annotated-list-of-my-favourite-1) Mechinterp practical guide review [[2407.02646] A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models](https://arxiv.org/abs/2407.02646) [GitHub - Dakingrai/awesome-mechanistic-interpretability-lm-papers](https://github.com/Dakingrai/awesome-mechanistic-interpretability-lm-papers) Mechanistic Interpretability for AI Safety -- A Review [[2404.14082v1] Mechanistic Interpretability for AI Safety -- A Review](https://arxiv.org/abs/2404.14082v1) Topological lens on generalization in llms [[2407.08723] Topological Generalization Bounds for Discrete-Time Stochastic Optimization Algorithms](https://arxiv.org/abs/2407.08723) https://x.com/tolga_birdal/status/1811717390813581747?t=bhwCAOkjA9nlyKGoMCwxcg&s=19 Tomography from deep mind reverse engineering AlphaZero finding chess knowledge [https://youtu.be/dCkQQYwPxdM?si=BgBydqvy5rS0cOh2](https://youtu.be/dCkQQYwPxdM?si=BgBydqvy5rS0cOh2) 16:40 Accelerating grokking [[2405.20233] Grokfast: Accelerated Grokking by Amplifying Slow Gradients](https://arxiv.org/abs/2405.20233) Mechiterp integrated gradients, salient maps and attributions 19:40 [https://youtu.be/dCkQQYwPxdM?si=65PtPB35kzxV6ZxF](https://youtu.be/dCkQQYwPxdM?si=65PtPB35kzxV6ZxF) Beyond Euclid: An Illustrated Guide to Modern Machine Learning with Geometric, Topological, and Algebraic Structures [[2407.09468] Beyond Euclid: An Illustrated Guide to Modern Machine Learning with Geometric, Topological, and Algebraic Structures](https://www.arxiv.org/abs/2407.09468) Sorry tokens [[2407.09121] Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training](https://arxiv.org/abs/2407.09121) - IOI (kevin wang, alexandre variengien) - ORION: look before you leap (alexandre variengien) - physics of language models 3.1: CFGs (zeyuan allen-zhu) Neuron to graph [[2305.19911] Neuron to Graph: Interpreting Language Model Neurons at Scale](https://arxiv.org/abs/2305.19911) Sparse feature circuits [[2403.19647] Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models](https://arxiv.org/abs/2403.19647) [GitHub - saprmarks/dictionary_learning](https://github.com/saprmarks/dictionary_learning) [GitHub - HoagyC/sparse_coding: Using sparse coding to find distributed representations used by neural networks.](https://github.com/HoagyC/sparse_coding) Towards transparent AI survey 2022 [[2207.13243] Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks](https://arxiv.org/abs/2207.13243) Modulo addition grokking multiple algorithms [[2306.17844] The Clock and the Pizza: Two Stories in Mechanistic Explanation of Neural Networks](https://arxiv.org/abs/2306.17844) 11 technical proposals for safe AI [[2012.07532] An overview of 11 proposals for building safe advanced AI](https://arxiv.org/abs/2012.07532) LLM critics help catch LLM bugs [[2407.00215] LLM Critics Help Catch LLM Bugs](https://arxiv.org/abs/2407.00215) Eliciting latent knowledge Theories of impact for mechinterp [A Longlist of Theories of Impact for Interpretability — LessWrong](https://www.lesswrong.com/posts/uK6sQCNMw8WKzJeCQ/a-longlist-of-theories-of-impact-for-interpretability) Transformer memory editing [[2210.07229] Mass-Editing Memory in a Transformer](https://arxiv.org/abs/2210.07229) RASP [[2310.16028] What Algorithms can Transformers Learn? A Study in Length Generalization](https://arxiv.org/abs/2310.16028) [[1611.03530] Understanding deep learning requires rethinking generalization](https://arxiv.org/abs/1611.03530) [Just Ask for Generalization | Eric Jang](https://evjang.com/2021/10/23/generalization.html) [[1703.04933] Sharp Minima Can Generalize For Deep Nets](https://arxiv.org/abs/1703.04933) [Understanding “Deep Double Descent” — AI Alignment Forum](https://www.alignmentforum.org/posts/FRv7ryoqtvSuqBxuT/understanding-deep-double-descent) [[1805.08522] Deep learning generalizes because the parameter-function map is biased towards simple functions](https://arxiv.org/abs/1805.08522) https://towardsdatascience.com/why-more-is-more-in-deep-learning-b28d7cedc9f5 [[2405.15071] Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization](https://arxiv.org/abs/2405.15071) https://arxiv.org/abs/1911.01547https://openai.com/index/prover-verifier-games-improve-legibility/ Gray Swan AI aligned LLMs https://x.com/GraySwanAI/status/1813638794232406061?t=zYlku9xsJGwWp5mIsChezg&s=19 Monitoring Latent World States in Language Models with Propositional Probes [[2406.19501] Monitoring Latent World States in Language Models with Propositional Probes](https://arxiv.org/abs/2406.19501) Representing the truth in LLMs Vision multiattacks fort [[2308.03792] Multi-attacks: Many images $+$ the same adversarial attack $\to$ many target labels](https://arxiv.org/abs/2308.03792) [Neuronpedia](https://www.neuronpedia.org/) [Tracing Model Outputs to the Training Data \ Anthropic](https://www.anthropic.com/research/influence-functions) AI Deception survey https://www.cell.com/patterns/fulltext/S2666-3899%2824%2900103-X?utm_source=perplexity [Towards Multimodal Interpretability: Learning Sparse Interpretable Features in Vision Transformers — LessWrong](https://www.lesswrong.com/posts/bCtbuWraqYTDtuARg/towards-multimodal-interpretability-learning-sparse-2) [I found >800 orthogonal "write code" steering vectors — LessWrong](https://www.lesswrong.com/posts/CbSEZSpjdpnvBcEvc/i-found-greater-than-800-orthogonal-write-code-steering?fbclid=IwZXh0bgNhZW0CMTEAAR1vDqfaQmobNGu8fr8wsUocHXr4Z7-G3iXMf_I2dItSIX-OYS6qaNmbTJM_aem_w8Ncb1VlRy4Xm_H_-b1aaQ) Scalable oversight Research Agenda for Sociotechnical Approaches to AI Safety [https://static1.squarespace.com/static/6086fb0cbf366f6273c435e5/t/66218dc6d387d41957835396/1713475015219/Research_Agenda_for_Sociotechnical_Approaches_to_AI_Safety.pdf](https://static1.squarespace.com/static/6086fb0cbf366f6273c435e5/t/66218dc6d387d41957835396/1713475015219/Research_Agenda_for_Sociotechnical_Approaches_to_AI_Safety.pdf) Eliciting latent knowledge stuart russell ai formal verification [BatchTopK: A Simple Improvement for TopK-SAEs — AI Alignment Forum](https://www.alignmentforum.org/posts/Nkx6yWZNbAsfvic98/batchtopk-a-simple-improvement-for-topk-saes) [[2407.11969] Does Refusal Training in LLMs Generalize to the Past Tense?](https://arxiv.org/abs/2407.11969) Emergence of not real [[2406.04391] Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive?](https://arxiv.org/abs/2406.04391) [Emergent abilities and grokking: Fundamental, Mirage, or both? – Windows On Theory](https://windowsontheory.org/2023/12/22/emergent-abilities-and-grokking-fundamental-mirage-or-both/) [[2304.15004] Are Emergent Abilities of Large Language Models a Mirage?](https://arxiv.org/abs/2304.15004) Truth lying https://x.com/sebkrier/status/1814765954217488884?t=D3J1CrU8Jp1B5AmXIkRftQ&s=19 [[2407.12831] Truth is Universal: Robust Detection of Lies in LLMs](https://arxiv.org/abs/2407.12831) [https://www.youtube.com/watch?v=y9_QFUma8Fo](https://www.youtube.com/watch?v=y9_QFUma8Fo) [The Platonic Representation Hypothesis](https://phillipi.github.io/prh/) How are memories stored in neural networks? | The Hopfield Network [https://www.youtube.com/watch?v=piF6D6CQxUw&t=699s](https://www.youtube.com/watch?v=piF6D6CQxUw&t=699s) rationalanimations What Do Neural Networks Really Learn? Exploring the Brain of an AI Model [https://www.youtube.com/watch?v=jGCvY4gNnA8&t=589s](https://www.youtube.com/watch?v=jGCvY4gNnA8&t=589s) LLMs vs brains [[2311.09308] Divergences between Language Models and Human Brains](https://arxiv.org/abs/2311.09308) Friston collective intelligence [[2212.01354] Designing Ecosystems of Intelligence from First Principles](https://arxiv.org/abs/2212.01354) [Towards Developmental Interpretability — LessWrong](https://www.lesswrong.com/posts/TjaeCWvLZtEDAS5Ex/towards-developmental-interpretability) Alignment methods survey [[2407.16216] A Comprehensive Survey of LLM Alignment Techniques: RLHF, RLAIF, PPO, DPO and More](https://arxiv.org/abs/2407.16216) 100 questions in AI governance https://x.com/AnkaReuel/status/1815778704616038880?t=U344ThjlHGrk6qpm7zOniA&s=19 Physics of intelligence https://x.com/Hidenori8Tanaka/status/1816006446019953015?t=QkP3qsKfMfK_60HHZF_01Q&s=19 [How I think about LLM prompt engineering](https://fchollet.substack.com/p/how-i-think-about-llm-prompt-engineering) Generalization Sharp Minima Can Generalize For Deep Nets [[1703.04933] Sharp Minima Can Generalize For Deep Nets](https://arxiv.org/abs/1703.04933) Deep learning generalizes because the parameter-function map is biased towards simple functions [[1805.08522] Deep learning generalizes because the parameter-function map is biased towards simple functions](https://arxiv.org/abs/1805.08522) Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization [[2405.15071] Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization](https://arxiv.org/abs/2405.15071) Neural networks generalize because of this one weird trick [Neural networks generalize because of this one weird trick — AI Alignment Forum](https://www.alignmentforum.org/posts/fovfuFdpuEwQzJu2w/neural-networks-generalize-because-of-this-one-weird-trick) On the Measure of Intelligence [[1911.01547] On the Measure of Intelligence](https://arxiv.org/abs/1911.01547) Understanding deep learning requires rethinking generalization [[1611.03530] Understanding deep learning requires rethinking generalization](https://arxiv.org/abs/1611.03530) Understanding “Deep Double Descent” [Understanding “Deep Double Descent” — AI Alignment Forum](https://www.alignmentforum.org/posts/FRv7ryoqtvSuqBxuT/understanding-deep-double-descent) Just Ask for Generalization [Just Ask for Generalization | Eric Jang](https://evjang.com/2021/10/23/generalization.html) Progress measures for grokking via mechanistic interpretability, reverse-engineering transformers learned on modular addition with learned emergent generalizing trigonometic functions circuit [[2301.05217] Progress measures for grokking via mechanistic interpretability](https://arxiv.org/abs/2301.05217) The Clock and the Pizza: Two Stories in Mechanistic Explanation of Neural Networks [[2306.17844] The Clock and the Pizza: Two Stories in Mechanistic Explanation of Neural Networks](https://arxiv.org/abs/2306.17844) A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations [[2302.03025] A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations](https://arxiv.org/abs/2302.03025) Automated interpretability agent [MAIA](https://multimodal-interpretability.csail.mit.edu/maia/) Emergent planning in RL https://x.com/farairesearch/status/1816766065050853509 LLMs are doing linearized subgraph matching https://x.com/alexisgallagher/status/1816940203585704289 https://x.com/jeremyphoward/status/1816945275195523291 On the Planning Abilities of Large Language Models : A Critical Investigation [https://arxiv.org/pdf/2305.15771](https://arxiv.org/pdf/2305.15771) Chain of Thoughtlessness? An Analysis of CoT in Planning [https://arxiv.org/pdf/2405.04776](https://arxiv.org/pdf/2405.04776) On the Self-Verification Limitations of Large Language Models on Reasoning and Planning Tasks [https://arxiv.org/pdf/2402.08115](https://arxiv.org/pdf/2402.08115) LLMs Can't Plan, But Can Help Planning in LLM-Modulo Frameworks [https://arxiv.org/pdf/2402.01817](https://arxiv.org/pdf/2402.01817) The embers of autoregression paper was also damning [https://arxiv.org/pdf/2309.13638](https://arxiv.org/pdf/2309.13638) [Circuits Updates - July 2024](https://transformer-circuits.pub/2024/july-update/index.html) https://x.com/a_karvonen/status/1819399813441663042 [[2408.00113] Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models](https://arxiv.org/abs/2408.00113) sparse autoencoders board games Open-endedness https://x.com/MLStreetTalk/status/1819432227375084005?t=OnK0cOF3hIh5aQPai_xo-w&s=19 [[2406.04268] Open-Endedness is Essential for Artificial Superhuman Intelligence](https://arxiv.org/abs/2406.04268) why machines learn book Unified view on MLPs, KANs, kernel SVMs, and probabilistic graphical models [[2407.04819] RPN: Reconciled Polynomial Network Towards Unifying PGMs, Kernel SVMs, MLP and KAN](https://arxiv.org/abs/2407.04819) Grokking theoretical model explanation [[2309.02390] Explaining grokking through circuit efficiency](https://arxiv.org/abs/2309.02390) [[2410.05603] Everything Everywhere All at Once: LLMs can In-Context Learn Multiple Tasks in Superposition](https://arxiv.org/abs/2410.05603) Automatic mechinterp [[2410.13928] Automatically Interpreting Millions of Features in Large Language Models](https://arxiv.org/abs/2410.13928) Learning-theoretic agenda lectures [Než budete pokračovat na YouTube](https://www.youtube.com/playlist?list=PLsJ9q3OrsguSWfFd1gO1OY64eKahKMFc2) [Video lectures on the learning-theoretic agenda — AI Alignment Forum](https://www.alignmentforum.org/posts/NWKk2eQwfuGzRXusJ/video-lectures-on-the-learning-theoretic-agenda) LLM-learned concepts: 1) They form brain-like "lobes", 2) they form "semantic crystals" much more precise than it first seems, and 3) the concept cloud is more fractal than round: https://x.com/tegmark/status/1851288315867041903?t=kXQoZyj51CMQ9S4Un4APbw&s=19 [[2410.19750] The Geometry of Concepts: Sparse Autoencoder Feature Structure](https://arxiv.org/abs/2410.19750) [Evaluating feature steering: A case study in mitigating social biases \ Anthropic](https://www.anthropic.com/research/evaluating-feature-steering) statistical mechanics of generalization https://x.com/CalcCon/status/1851423638035292328 Algebraic Geometry and Statistical Learning Theory [Algebraic Geometry and Statistical Learning Theory](https://www.cambridge.org/core/books/algebraic-geometry-and-statistical-learning-theory/9C8FD1BDC817E2FC79117C7F41544A3A) reverse engineering protein foundational ML models using mechanistic interpretability https://fxtwitter.com/liambai21/status/1852765669080879108 [[1905.11027] A Geometric Modeling of Occam's Razor in Deep Learning](https://arxiv.org/abs/1905.11027) Provably Safe Systems: The Only Path to Controllable AGI [https://www.youtube.com/watch?v=nUrYCUkTFE4](https://www.youtube.com/watch?v=nUrYCUkTFE4) Open-Endedness and General Intelligence - Tim Rocktäschel (Google DeepMind & UCL) [https://www.youtube.com/watch?v=Ums_VKKf_s4&t=624s](https://www.youtube.com/watch?v=Ums_VKKf_s4&t=624s) Ai theory resources [Reddit - The heart of the internet](https://www.reddit.com/r/MachineLearning/s/JsVdoJn8lb) Forcefully Control AI’s ideology is... not working (yet?) [https://www.youtube.com/watch?v=qHv1YLdwgRk](https://www.youtube.com/watch?v=qHv1YLdwgRk) [Evaluating feature steering: A case study in mitigating social biases \ Anthropic](https://www.anthropic.com/research/evaluating-feature-steering) Great Anthropic's LLM debiasing research: In LLMs, to debias, you wanna steer the "multiple perspectives" and "neutrality" feature from sparse autoencoders. But not too much, as too much of steering of various biases in any directions leads to more dumb model. And steering some biases sometimes leads to unexpected steering of many other biases, sometimes even stronger if you steer the other ones on their own. It's complex as always. https://x.com/burny_tech/status/1860875226789122524 But in the context of this, i'm already enjoying Google's AIs not telling you about anything unethical that Google has done, or Chinese AIs not knowing what happened on April 1989 on Tiananmen Square. Generalization definitions [[2411.15626] Aligning Generalisation Between Humans and Machines](https://arxiv.org/abs/2411.15626) [[1911.01547] On the Measure of Intelligence](https://arxiv.org/abs/1911.01547) [[1909.11522] Neural networks are a priori biased towards Boolean functions with low entropy](https://arxiv.org/abs/1909.11522) [[1805.08522] Deep learning generalizes because the parameter-function map is biased towards simple functions](https://arxiv.org/abs/1805.08522) [[2209.01610] Generalization in Neural Networks: A Broad Survey](https://arxiv.org/abs/2209.01610) [[2210.14891] Broken Neural Scaling Laws](https://arxiv.org/abs/2210.14891) https://towardsdatascience.com/what-can-flatness-teach-us-understanding-generalisation-in-deep-neural-networks-a7d66f69cb5c [https://www.youtube.com/watch?v=jIm2T7h_a0M](https://www.youtube.com/watch?v=jIm2T7h_a0M) [[2411.12580] Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models](https://arxiv.org/abs/2411.12580) [[2410.21272] Arithmetic Without Algorithms: Language Models Solve Math With a Bag of Heuristics](https://arxiv.org/abs/2410.21272) [How close is AI to human-level intelligence?](https://www.nature.com/articles/d41586-024-03905-1) [Are Video Generation Models World Simulators? · Artificial Cognition](https://artificialcognition.net/posts/video-generation-world-simulators/) [[2402.15555] Deep Networks Always Grok and Here is Why](https://arxiv.org/abs/2402.15555) [https://proceedings.mlr.press/v80/balestriero18b/balestriero18b.pdf](https://proceedings.mlr.press/v80/balestriero18b/balestriero18b.pdf) [[2301.09554] Deep Learning Meets Sparse Regularization: A Signal Processing Perspective](https://arxiv.org/abs/2301.09554) [Shallow review of technical AI safety, 2024 — LessWrong](https://www.lesswrong.com/posts/fAW6RXLKTLHC3WXkS/shallow-review-of-technical-ai-safety-2024) Reverse engineering of the reasoning models DeepSeek is emerging! They apparently found a backtracking vector "that when applied, caused the chain of thought to backtrack much more often, and when suppressed caused it to be a linear and much shorter CoT"! And they think that sparse autoencoders will find similar features that are general "functions" the model has learned for reasoning that you can they explicitly steer, manipulate, edit etc., like backtracking, forking, reflection, selfcorrection, or "attention sinks" that cause it to focus more on something! https://fxtwitter.com/chrisbarber/status/1885047105741611507 https://x.com/norabelrose/status/1887972442104316302 Skip transcoders instead of SAEs We introduce skip transcoders, and find they are a Pareto improvement over SAEs: better interpretability, and better fidelity to the model 🧵 https://x.com/rohanpaul_ai/status/1891047293446422823?t=bin6ZFeX0RcTJhMJx9BY0g&s=19 [[2412.18624] How to explain grokking](https://arxiv.org/abs/2412.18624) rethinking generalization in deep learning [[2503.02113] Deep Learning is Not So Mysterious or Different](https://arxiv.org/abs/2503.02113) [@andrewgwils.bsky.social on Bluesky](https://bsky.app/profile/andrewgwils.bsky.social/post/3ljncqatngc2w) [[2105.02716] Noether's Learning Dynamics: Role of Symmetry Breaking in Neural Networks](https://arxiv.org/abs/2105.02716)