Mechiterp project paper ideas

Circuit for deception/lying/kill humans expand on shortcircuitting Learn finite state automata instead of SEAs help neuropedia Try different sparse autoencoders on different model sizes and graph it all, how many features they got, graph hyperparameters, different sparse autoencoder hidden d sizes, trainign idfferent SAEs on different layers or multiple layers with one SAE and compare learned features, SAE on the whole model activations OthelloGPT graph circuit formation Map of known circuits Automating mechiterp training SAEs on Llama 3.1 405B Reverse engineer llama3 500b Activation engineering llama 502b, just the simplest activations without prompt - activations with prompt mechinterp engineer Finding deception/lying circuits using SAE Sparse feature circuits expand, automate, map Help with data engineering or whatever else is needed (napsat vsem mechinterp lidem) Formal language for circuits for formal verification? Finding circuits with SAEs Do features get more interpretable in bigger models? Compare features in SAEs different layers by attribution patching Mechinterp in medicine Scaling automated discovery of circuits [[2304.14997] Towards Automated Circuit Discovery for Mechanistic Interpretability](https://arxiv.org/abs/2304.14997) SAE on AlphaZero, alphafold, opensora, multimodal models, OthelloGPT Two hidden layers SAE Mixture of experts SAE Hypotheses in: "Deep learning systems are a weird messy fuzzy ecosystem of interconnected circuits. Various circuits memorize and other generalize on a spectrum. An example of a circuit is an induction heads. These circuits are in superpositions and in various ways distributed. They are differently fuzzy and differently stable to random perturbations. They compose to various meta circuits like implicit object identification circuit. Initial layers of the AI model encode more low level feature detectors and later layers form more composed complex concept detectors. On top of these layers you can do more fine grained or more coarse grained disentagling and decomposition of features and circuits using sparse autoencoders etc. in mechanistic interpretability, which is a field that reverse engineers AI systems." deception output = all possible output - no deception output? Stuff in 200 open problems in mechinterp https://www.lesswrong.com/posts/LbrPTJ4fmABEdEnLf/200-concrete-open-problems-in-mechanistic-interpretability https://www.lesswrong.com/posts/KfkpgXdgRheSRWDy8/a-list-of-45-mech-interp-project-ideas-from-apollo-research Msg Neel Nanda for paper ideas from Mats Supercharge caring, minimize deception steering vector Compare steering (of different vectors), prompt engineering, fine-tuning etc. on alignment evals Mapping the mechinterp landscape (get all links from discords and structure via LLM) Literature review Mechinterp rag bot? automated mechinterp agent [Introduction to Research Augmentation for Alignment - Jacques Thibodeau - YouTube](https://www.youtube.com/watch?v=WAEP7xRaDEQ) reverse engineering whisper or audio models open problems with SAEs [x.com](https://x.com/NeelNanda5/status/1817710781724499975)