List of LLM interpretability resources
https://fxtwitter.com/omarsar0/status/1738592208054370723
[GitHub - JShollaj/awesome-llm-interpretability: A curated list of Large Language Model (LLM) Interpretability resources.](https://github.com/JShollaj/awesome-llm-interpretability)
Finding Neurons in a Haystack: Case Studies with Sparse Probing - Explores the representation of high-level human-interpretable features within neuron activations of large language models (LLMs).
Copy Suppression: Comprehensively Understanding an Attention Head - Investigates a specific attention head in GPT-2 Small, revealing its primary role in copy suppression.
Linear Representations of Sentiment in Large Language Models - Shows how sentiment is represented in Large Language Models (LLMs), finding that sentiment is linearly represented in these models.
Emergent world representations: Exploring a sequence model trained on a synthetic task - Explores emergent internal representations in a GPT variant trained to predict legal moves in the board game Othello.
Towards Automated Circuit Discovery for Mechanistic Interpretability - Introduces the Automatic Circuit Discovery (ACDC) algorithm for identifying important units in neural networks.
A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations - Examines small neural networks to understand how they learn group compositions, using representation theory.
Causal Mediation Analysis for Interpreting Neural NLP: The Case of Gender Bias - Causal mediation analysis as a method for interpreting neural models in natural language processing.
The Quantization Model of Neural Scaling - Proposes the Quantization Model for explaining neural scaling laws in neural networks.
Discovering Latent Knowledge in Language Models Without Supervision - Presents a method for extracting accurate answers to yes-no questions from language models' internal activations without supervision.
How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model - Analyzes mathematical capabilities of GPT-2 Small, focusing on its ability to perform the 'greater-than' operation.
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning - Using a sparse autoencoder to decompose the activations of a one-layer transformer into interpretable, monosemantic features.
Language models can explain neurons in language models - Explores how language models like GPT-4 can be used to explain the functioning of neurons within similar models.
Emergent Linear Representations in World Models of Self-Supervised Sequence Models - Linear representations in a world model of Othello-playing sequence models.
"Toward a Mechanistic Understanding of Stepwise Inference in Transformers: A Synthetic Graph Navigation Model" - Explores stepwise inference in autoregressive language models using a synthetic task based on navigating directed acyclic graphs.
"Successor Heads: Recurring, Interpretable Attention Heads In The Wild" - Introduces 'successor heads,' attention heads that increment tokens with a natural ordering, such as numbers and days, in LLM’s.
"Large Language Models Are Not Robust Multiple Choice Selectors" - Analyzes the bias and robustness of LLMs in multiple-choice questions, revealing their vulnerability to option position changes due to inherent "selection bias”.
"Going Beyond Neural Network Feature Similarity: The Network Feature Complexity and Its Interpretation Using Category Theory" - Presents a novel approach to understanding neural networks by examining feature complexity through category theory.
"Let's Verify Step by Step" - Focuses on improving the reliability of LLMs in multi-step reasoning tasks using step-level human feedback.
"Interpretability Illusions in the Generalization of Simplified Models" - Examines the limitations of simplified representations (like SVD) used to interpret deep learning systems, especially in out-of-distribution scenarios.
"The Devil is in the Neurons: Interpreting and Mitigating Social Biases in Language Models" - Presents a novel approach for identifying and mitigating social biases in language models, introducing the concept of 'Social Bias Neurons'.
"Interpreting the Inner Mechanisms of Large Language Models in Mathematical Addition" - Investigates how LLMs perform the task of mathematical addition.
"Measuring Feature Sparsity in Language Models" - Develops metrics to evaluate the success of sparse coding techniques in language model activations.
Toy Models of Superposition - Investigates how models represent more features than dimensions, especially when features are sparse.
Spine: Sparse interpretable neural embeddings - Presents SPINE, a method transforming dense word embeddings into sparse, interpretable ones using denoising autoencoders.
Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors - Introduces a novel method for visualizing transformer networks using dictionary learning.
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling - Introduces Pythia, a toolset designed for analyzing the training and scaling behaviors of LLMs.
On Interpretability and Feature Representations: An Analysis of the Sentiment Neuron - Critically examines the effectiveness of the "Sentiment Neuron”.
Engineering monosemanticity in toy models - Explores engineering monosemanticity in neural networks, where individual neurons correspond to distinct features.
Polysemanticity and capacity in neural networks - Investigates polysemanticity in neural networks, where individual neurons represent multiple features.
An Overview of Early Vision in InceptionV1 - A comprehensive exploration of the initial five layers of the InceptionV1 neural network, focusing on early vision.
Visualizing and measuring the geometry of BERT - Delves into BERT's internal representation of linguistic information, focusing on both syntactic and semantic aspects.
Neurons in Large Language Models: Dead, N-gram, Positional - An analysis of neurons in large language models, focusing on the OPT family.
Can Large Language Models Explain Themselves? - Evaluates the effectiveness of self-explanations generated by LLMs in sentiment analysis tasks.
Interpretability in the Wild: GPT-2 small (arXiv) - Provides a mechanistic explanation of how GPT-2 small performs indirect object identification (IOI) in natural language processing.
Sparse Autoencoders Find Highly Interpretable Features in Language Models - Explores the use of sparse autoencoders to extract more interpretable and less polysemantic features from LLMs.
Emergent and Predictable Memorization in Large Language Models - Investigates the use of sparse autoencoders for enhancing the interpretability of features in LLMs.
Transformers are uninterpretable with myopic methods: a case study with bounded Dyck grammars - Demonstrates that focusing only on specific parts like attention heads or weight matrices in Transformers can lead to misleading interpretability claims.
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets - This paper investigates the representation of truth in Large Language Models (LLMs) using true/false datasets.
Interpretability at Scale: Identifying Causal Mechanisms in Alpaca - This study presents Boundless Distributed Alignment Search (Boundless DAS), an advanced method for interpreting LLMs like Alpaca.
Representation Engineering: A Top-Down Approach to AI Transparency - Introduces Representation Engineering (RepE), a novel approach for enhancing AI transparency, focusing on high-level representations rather than neurons or circuits.
[[2310.06824] The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets](https://arxiv.org/abs/2310.06824)
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
"1. Visualizations of LLM true/false statement representations, which reveal clear linear structure. 2. Transfer experiments in which probes trained on one dataset generalize to different datasets. 3. Causal evidence obtained by surgically intervening in a LLM's forward pass, causing it to treat false statements as true and vice versa. Overall, we present evidence that language models linearly represent the truth or falsehood of factual statements."
[Going Beyond Neural Network Feature Similarity: The Network Feature Complexity and Its Interpretation Using Category Theory | OpenReview](https://openreview.net/forum?id=4bSQ3lsfEV)
Going Beyond Neural Network Feature Similarity: The Network Feature Complexity and Its Interpretation Using Category Theory
1) the larger the network, the more redundant features it learns; 2) in particular, we show how to prune the networks based on our finding using direct equivalent feature merging, without fine-tuning which is often needed in peer network pruning methods; 3) same structured networks with higher feature complexity achieve better performance; 4) through the layers of a neural network, the feature complexity first increase then decrease; 5) for the image classification task, a group of functionally equivalent features may correspond to a specific semantic meaning. Source code will be made publicly available.
ssm models like hyena were designed using mech interp principles: [Hyena Hierarchy: Towards Larger Convolutional Language Models · Hazy Research](https://hazyresearch.stanford.edu/blog/2023-03-07-hyena)
some mechinterp people think mamba and/or fourier transforms are the more optimal way to do large language modeling than transformers
https://twitter.com/burny_tech/status/1738698544125444604 We can do singularitianism, safety and universal sentientism at the same time
https://www.lesswrong.com/posts/uG7oJkyLBHEw3MYpT/generalization-from-thermodynamics-to-statistical-physics singular learning theory generalization
[NeurIPS 2023 Recap — Best Papers - Latent Space](https://www.latent.space/p/neurips-2023-papers) neurips 2023 summary
I have both competing wordcel processes and shapeshifting senses of intuion forming superpositions of embodiments of ecosystems of egregores
jak mi entropie pomůže v tom stanovit mapování mezi tokeny a bity informace ?
hmm, je pravda že v tomto kontextu to asi moc nepomůže
asi nepůjde takhle ze začátku když ty data jsou nestrukturovaný no
to chce zjistit minimální množství variables a jejich states co se neuronka mohla teoreticky naučit aby ty data (bez memorizace) pořád predikovala, což se dělá minimálně u ML modelování fyzikálních systémů přes změnšování autoencoderů
takový experiment by u GPT4 sized modelů moc udělat nešel bez toho aby to stálo bambioliony no 😄 plus by to musela být reálně co nejvíc efektivní optimalizovaná architektura než ten chaos co se děje uvnitř i u těch menších jazykových modelů, kde pořád identifikujeme metody jak je prunovat a nezmenšit jejich performace
až tohle (věřím že) mechanistická interpretability pořádně crakne, tak se modely hodně zoptimalizují
což vlastně zapříčinilo to že teď emergují ty statespace models typu Hyena jako alternativy k transformerům co jsou v různých aspektech efektivnější díky využití výsledkům z mechanistic interpretability
u nestrukturovaných mapping 1 token ~ 1 bit informace asi smysl dává
když o tom jako nenatrénovaná neuronka nic nevíš tak je každý token nová informace
u natrénovaných neuronek a lidí by se ale naučená struktura počítat asi měla
nebo z pohledu bayesian mechaniky "objektivní" "ideální" v perfektních podmínkách "naučitelný" struktury, ( což nevím jestli jde ani teoreticky protože si člověk vždycky musí vybrat (subjektivní) reference frame [Reference class problem - Wikipedia](https://en.wikipedia.org/wiki/Reference_class_problem)
musí asi nějak vzít všechny možný reference frames nebo ty nejčastěji používaný, (tady predikce tokenů, u lidí evolutionary fitness?), nebo ještě využít [Entropy, Information gain, and Gini Index; the crux of a Decision Tree](http://www.clairvoyant.ai/blog/entropy-information-gain-and-gini-index-the-crux-of-a-decision-tree) )
což je v praxi v podstatě nemožný získat
damn you kologomov komplexity, jejiž výpočet je ekvivalentní s halting problémem
ale v konečným vesmíru by to možná šlo dostatečně dobře aproximovat
na tom má postevný teoretický model AI chief scientist z DeepMindu
a https://www.lesswrong.com/tag/aixi
How to solve the Reference class problem? Omniperspectivity?
[Reference class problem - Wikipedia](https://en.wikipedia.org/wiki/Reference_class_problem)
"Every single thing or event has an indefinite number of properties or attributes observable in it, and might therefore be considered as belonging to an indefinite number of different classes of things, leading to problems with how to assign probabilities to a single case.
For example, to estimate the probability of an aircraft crashing, we could refer to the frequency of crashes among various different sets of aircraft: all aircraft, this make of aircraft, aircraft flown by this company in the last ten years, etc. In this example, the aircraft for which we wish to calculate the probability of a crash is a member of many different classes, in which the frequency of crashes differs. It is not obvious which class we should refer to for this aircraft. In general, any case is a member of very many classes among which the frequency of the attribute of interest differs. The reference class problem discusses which class is the most appropriate to use.
In statistics, the reference class problem is the problem of deciding what class to use when calculating the probability applicable to a particular case.
In Bayesian statistics, the problem arises as that of deciding on a prior probability for the outcome in question (or when considering multiple outcomes, a prior probability distribution)."
How about theoretically considering the set of all possible prior probabilities that still give some predictive power? Aka in order to get an objective perspective on reality in terms of maximizing predictive power, if given infinite compute, one could consider and evaluate all possible perspectives. The whole statespace of possible generative models predicting the whole statespace of possible prior probability distributions in a given domain.
this is still model-laden in the meta-space of "how are you classifying possible models", self reference is a bear
I think usually different appropriate-seeming reference classes don't give you that different estimates and if they do, you probably need to revise your general model of reference class and their memberships functions or ditch using reference classes as main inputs to inference.
Zeta Alpha Trends in AI - December 2023 - Gemini, NeurIPS & Trending AI Papers [Zeta Alpha Trends in AI - December 2023 - Gemini, NeurIPS & Trending AI Papers - YouTube](https://www.youtube.com/watch?v=6iLBWEP1Ols)
yannic mamba [Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Paper Explained) - YouTube](https://www.youtube.com/watch?v=9dSkvxS2EB0)
[[2105.14103] An Attention Free Transformer](https://arxiv.org/abs/2105.14103) An Attention Free Transformer
I have both competing wordcel processes and shapeshifting senses of intuion forming superpositions of embodiments of ecosystems of egregores
“Our paper suggests that the information flow inside Transformers can be decomposed cleanly at a macroscopic level. This gives hope that we could design safety applications to know what models are thinking or intervene on their mechanisms without the need to fully understand their internal computations.” https://www.lesswrong.com/posts/uCuvFKnvzwh34GuX3/a-universal-emergent-decomposition-of-retrieval-tasks-in
mamba https://twitter.com/labenz/status/1738214611168395398
improving diffusion models https://twitter.com/isskoro/status/1738661307455316236
prisoners dillema [What Game Theory Reveals About Life, The Universe, and Everything - YouTube](https://www.youtube.com/watch?v=mScpHTIi-kM)
Attribution Patching Outperforms Automated Circuit Discovery! [[2310.10348] Attribution Patching Outperforms Automated Circuit Discovery](https://arxiv.org/abs/2310.10348) We apply a linear approximation to activation patching to estimate the importance of each edge in the computational subgraph. Using this approximation, we prune the least important edges of the network. We survey the performance and limitations of this method, finding that averaged over all tasks our method has greater AUC from circuit recovery than other methods.
https://fxtwitter.com/mezaoptimizer/status/1729981499397603558
[[2311.15131] Localizing Lying in Llama: Understanding Instructed Dishonesty on True-False Questions Through Prompting, Probing, and Patching](https://arxiv.org/abs/2311.15131) Localizing Lying in Llama: Understanding Instructed Dishonesty on True-False Questions Through Prompting, Probing, and Patching
AI for all? Yes? No? Some Middle ground?
Democratized intelligence freedom as basic human right to prevent intelligence monopoly inequality by the most GPU rich?
Bad actors having access to nuclearlike weapons or creating regulated access, and maximizing defence against them by same AI superintelligences?
Is preventing corruption possible that leads to regulatory capture for sociopathic power instead of flourishing for all of sentience?
Acceleration and collapse and total reconfiguration of current system or dystopic tyrrany or posthuman utopian Universal Basic Services for all of sentience?
Harmonics of Learning: Universal Fourier Features Emerge in Invariant Networks
[[2312.08550] Harmonics of Learning: Universal Fourier Features Emerge in Invariant Networks](https://arxiv.org/abs/2312.08550)
https://twitter.com/pratyusha_PS/status/1739025292805468212?t=FNjSeOX2xpMyLAx_0g8yQA&s=19 LAyer SElective Rank Reduction
Promptbench LLM benchmarking framework https://twitter.com/omarsar0/status/1739360426134028631?t=XssxN61w8XTzambK3hP8EA&s=19
[[2312.11514] LLM in a flash: Efficient Large Language Model Inference with Limited Memory](https://arxiv.org/abs/2312.11514) LLM in a flash: Efficient Large Language Model Inference with Limited Memory
Earl K. Miller @MillerLabMIT important paradigm shift taking place in neuroscience and about his latest work on cytoelectric coupling and top-down causation in the brain. [Earl K. Miller on Brain Waves and Top-Down Neuroscience - YouTube](https://youtu.be/xkwylDeMWIA.)
I really wonder how effective would an AI architecture be using these maths of cytoelectric coupling and top-down causation in the brain was used. I wanna see benchmarks!
one can technically list out all the biological and physical differences, the size, networks, training data etc., between humans and chips, but im not sure we the difference on the algorithmic level, or do we? both seem to have forward forward algorithm according to Hinton... or is neocortex all that's needed for general processing? but chimps have it too and its very similar https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5110243/ do current AI systems already have something like that functionally? how do we put that diffference algorithmically into AI? how can humans with so little compured compared to current gigantic AI systems do so many things, hmmmm... is it even mathematizable, or its unscrutable evolution ducttaping the geometry of the brain and genetically installing all sorts of priors?
Are you ready for the massive LLM psyops
Philosophy departments are gain of function memetic laboratories
https://twitter.com/nearcyan/status/1532076277947330561 heavenbanning, banishing a user from a platform by causing everyone that they speak with to be replaced by AI models that constantly agree and praise them, but only from their own perspective
If we dont open source AGI breathrough, feds will knock on our door and get it for themselves - George Hotz [The AI Alignment Debate: Can We Develop Truly Beneficial AI? (HQ version) - YouTube](https://www.youtube.com/watch?v=iFUmWho7fBE)
[The Most Efficient Way to Destroy the Universe ‚Äì False Vacuum - YouTube](https://www.youtube.com/watch?app=desktop&v=ijFm6DxNVyI) The Most Efficient Way to Destroy the Universe – False Vacuum
memetic antibodies