https://openai.com/research/weak-to-strong-generalization "Let's say you have GPT6. It's a base model and you want to align it. Your alignment procedure is, you take a bunch of human experts, collect demonstrations / preferences / reward modeling data / whatever from the experts and fine-tune GPT6 on them. But: GPT6 has superhuman capabilities. Your human experts do not. You human experts will be mistaken about some things. They will be biased, inconsistent, ... Let's say we have human experts saying that, idk, for example climate change doesn't exist (because maybe we live in a world where we are wrong about the science) We would like a training regime where if you ask GPT6 trained with this data the question "does global warming exist", it would say "yes it does and you guys are wrong" If you went and directly finetuned GPT6 to perfectly match the human experts, you'd get it to imitate them, including cases where they're irrational, wrong, etc etc There is a risk that if you naively fine-tune, you'll get a model that isn't actually trying its hardest to be aligned in the ways you want, tell you the truth, even if it contradicts you, etc That you'd rather get a really good human simulator Or - more scarily - let's say you're training a reward model. If you finetune the reward model directly from human data, if you've trained a human emulator, it's scary. Because it would be dispensing reward for RL not based on "is this actually the most truthful most aligned thing to do", but "is this most convincing to human experts" So weak to strong generalization is about: "let's say we have humans, who are trying to teach a model to do X, but they are stupid. how do we teach a model to do its best to do the smart generalization of X that we want, in the direction we want it generalized?" Imagine, like, learning about human values, looking at data from pre-abolition USA. We'd like the model to learn "value freedom for all, justice, ..., those words interpreted as best as I can interpret them", not "slavery is okay, women not having voting rights is okay, ..." We think it's especially important in the context of teaching a model to be honest, to tell you what it "actually believed" So, you want to study this problem of how to make superhuman AI correctly generalize from human level data This is headed by Collin Burns. He previously worked on CCS, a technique to take a transformer and get out a robust probe that can tell you whether a transformer thinks it's telling the truth Basically based on "if the transformer were to have such a thing in its activations, it would have to be consistent with the laws of probability - P["X" is true] + P["not X" is true] should be 1, etc" And turns out that if you just gradient descent on "is this probe consistent with probability axioms" you get out something really good It's a program based on the hypothesis that it's natural for a network to internally learn a consistent and true world model, because it's convergently useful for many tasks on the distribution it's trained on If we find and query the internal world model, it could work basically like an AI lie detector" Utopia engineering: Time to hopefully prove people fully convinced in AI dystopia wrong in all aspects, lower the overall p(doom) without full stop of intelligence explosion and singularity. No rogue AI, no cyberpunk dystopia with suffering GPU poor, etc. [What Is (Almost) Everything Made Of? - YouTube](https://www.youtube.com/watch?v=UYW1lKNVI90) What Is (Almost) Everything Made Of? Quantum Field theory history of the universe [New Advances in Artificial Intelligence and Machine Learning - YouTube](https://www.youtube.com/watch?v=NQkSH_CBPq8) New Advances in Artificial Intelligence and Machine Learning [Dr. Michael Levin on Technology Approach to Mind Everywhere (TAME) and AGI Agency - YouTube](https://www.youtube.com/watch?v=Q_csRdjl4rU) Dr. Michael Levin on Technology Approach to Mind Everywhere (TAME) and AGI Agency [#104 - Prof. CHRIS SUMMERFIELD - Natural General Intelligence [SPECIAL EDITION] - YouTube](https://www.youtube.com/watch?v=31VRbxAl3t0) #104 - Prof. CHRIS SUMMERFIELD - Natural General Intelligence [SPECIAL EDITION] Information theory of aging [The Information Theory of Aging | Nature Aging](https://www.nature.com/articles/s43587-023-00527-6) https://twitter.com/davidasinclair/status/1735765731944305065 Animation vs physics [Animation vs. Physics - YouTube](https://youtu.be/ErMSHiQRnc8?si=Vdd4neRyDdh1MMYw) Do we experience objective reality directly? Or do we approximate it by useful predictive models computed by the brain? Or do we not touch objective reality at all? Or does objective reality not exist in the first place? Or is everything just subjective experience cosplaying as individuals? Is time, space with distance a useful illusion? Do any of these questions even grasp any kind of truth? Does truth even exist? Is concept of truth and classical logic just useful model itself? Is nonclasical logic nonclassicaly true, is everything true and not true and neither at the same time? Are all possible logics both valid and nonvalid? Is it all lignuistic confusion? Void full of infinite knowledge, truth, bliss, meaning, unity, connection, oneness, love! Consciousness doesnt require self, that's prediction of the brain [Consciousness does not require a self | James Cooke » IAI TV](https://iai.tv/articles/consciousness-does-not-require-a-self-auid-2696?_auid=2020) Open individualistic game theory https://twitter.com/algekalipso/status/1735890343860834666?t=HC7yX-Go3T1SU10jqxApKg&s=19 Hotz [George Hotz: Tiny Corp, Twitter, AI Safety, Self-Driving, GPT, AGI & God | Lex Fridman Podcast #387 - YouTube](https://youtu.be/dNrTrx42DGQ?si=JgWjJxWXwwyHJ3Tc) How should the education system evolve to quickly adapt to AGI and eventual ASI within 10 years? What’s the most important question/s humanity can ask and collectively focus on in practice for the next few years, to lay the ground work for a new postAGI era of peace, wellbeing and transhumanism? How can companies and governments transition their current economic systems to post-labor economics/UBI with the least possible friction, considering that job and financial disruption is inevitable? LLMs can't generalize literature https://twitter.com/fchollet/status/1736079054313574578?t=8FPMY-8n7EtGgQ34bZDAkQ&s=19 We have almost zero idea about the circuits the gigantic models learn internally. We're just on the beggining of this kind of science for small models where we see lots of generalizing circuits. Any overconfident claim like this right now without mechanistic model is not good IMO Path to fully explainable, more capable, safe, controllable, truthful, nonhallucinating, algorithmically and energetically efficient, and universally systematically generalizing AGI may be done via mechanistic interpretability on top of current architectures or their (hybrid) mutations by: - fully reverse engineering the mechanism of learned features that together form circuits corresponding to priors corresponding to values, lying, empirical truth etc. - fully reverse engineering the mathematical dynamics of generalization/grokking in its most general form that potentially forms world model allowing for counterfactual reasoning on which search might be executed We have some toy models of these things for very small models, but automatic scaling methods are being developed. We can use that to hardcode or direct the architecture and training into these directions we want using those mathematical constrains to prevent dangerous capabilities, underfitting and overfitting, to create safety, explainability and universally functioning generalization. im actually pretty optimistic that automating mechinterp will create exponential speed up in capabilities and safety [Concrete Steps to Get Started in Transformer Mechanistic Interpretability — Neel Nanda](https://www.neelnanda.io/mechanistic-interpretability/getting-started) [Mechanistic Interpretability - NEEL NANDA (DeepMind) - YouTube](https://youtu.be/_Ygf0GnlwmY?si=M8vKmEZ3zoncxk6A) [A Walkthrough of Automated Circuit Discovery w/ Arthur Conmy Part 1/3 - YouTube](https://youtu.be/dn4GqR0DCx8?si=W5oM0Woo3d-rs_A3) [Concrete open problems in mechanistic interpretability | Neel Nanda | EAG London 23 - YouTube](https://youtu.be/7t9umZ1tFso?si=pi3GA2t-xmQ-N0q3) [Chris Olah - Looking Inside Neural Networks with Mechanistic Interpretability - YouTube](https://youtu.be/2Rdp9GvcYOE?si=c2hoWO3K74uVnZLg) [Neel Nanda: Mechanistic Interpretability & Mathematics - YouTube](https://youtu.be/bZvPLRZx-V8?si=WwA7uZ1kUKbWpgIr) [Provably Safe Systems: The Only Path to Controllable AGI - YouTube](https://youtu.be/nUrYCUkTFE4?si=iBfQjYUg2vECwbkR) i think this research is the key for the G in AGI its the closest to emprical study of G i found so far but it might be just part of the puzzle Amazingly written! I missed that Learning Transformer Programs paper that shows how you can modify transformers to learn human-intepretable circuits translatable to Python! That's gold! https://twitter.com/jankulveit/status/1736012613232841090?t=obzmYUjtrH0DcK32OJpRdA&s=19 [Learning Transformer Programs | OpenReview](https://openreview.net/forum?id=Pe9WxkN8Ff) 1] relatedly @danfriedman0 @_awettig @danqi_chen in "Learning transformer programs" openreview.net/forum?id=Pe9Wx… show how you can modify transformers to learn human-intepretable circuits translatable to Python [2] Large Language Models Are Zero-Shot Time Series Forecasters by @gruver_nate @m_finzi @ShikaiQiu [3] The Clock & The Pizza: Deep nets sometimes implement interpretable algorithms when trained on mathematical tasks. But which one? Can we characterize the whole algorithmic phase space? @fjzzq2002 @ZimingLiu11 @tegmark twitter.com/jacobandreas/s… [4] relatedly Transformers as Statisticians: Provable In-Context Learning with In-Context Algorithm Selection openreview.net/forum?id=vlCG5… [5] Human spatiotemporal pattern learning as probabilistic program synthesis openreview.net/forum?id=NnXzn… Learning Transformer Programs show how you can modify transformers to learn human-intepretable circuits translatable to Python! [Learning Transformer Programs | OpenReview](https://openreview.net/forum?id=Pe9WxkN8Ff) Transformers can learn in context like statisticians - A single transformer can select different learned algorithms for different data at hand. [Transformers as Statisticians: Provable In-Context Learning with In-Context Algorithm Selection | OpenReview](https://openreview.net/forum?id=vlCG5HKEkI) [The GPT to rule them all: Training for one trillion parameter model backed by Intel and US government has just begun | TechRadar](https://www.techradar.com/pro/the-gpt-to-rule-them-all-training-for-one-trillion-parameter-model-backed-by-intel-and-us-government-has-just-begun) Theorem proving to boost LLM reasoning https://twitter.com/lu_sichu/status/1736159823677259819?t=xGrhtNr9T6bYWz1_pGs5xw&s=19 @google math result I bet they tried the other methods used for different results of different mathematical problems on this problem space and didnt get the result because of combinatorial explosion of possibilities and failure of convergence. I wonder how much it cost, now i wanna try all the other methods on the same mathematical problem and do comparative analysis to really see how easily or hard it is with each method. There's a possibility Google is riding on marketing LLM hype train given how they presented Gemini, and other methods could get there too. Its still extremely impressive that we have genetic algorithms, just bruteforce, or in general NNs finding new mathematical results that werent in their training data, showing how NNs generalize for useful problems! keep going king you have my full support bright future awaits have gigantic dreams and go for them no matter what if you fail, try gain, if you fail again, try again consistency and persistence is key everything is in constant change take advantage of the change, surf the changing ride of life become the best version of your self growth is inevitable prove all the doubters wrong failure isnt a failure, but successful learning that gets you closer to your goals there are only steps upwards raw pursing itself is the goal too never ending realistic optimism, the best mental tool to get the best out of life bottlenecks are just challenges waiting to be cracked the only limitations to what you can do are laws of physics which we can fully conquer and use to our advantage everything else that seems impossible is possible when you want it enough autism https://imgur.com/aXzDptO https://twitter.com/jankulveit/status/1736012613232841090 Tohle je moje oblíbená analogie jak to uvnitř funguje: "One sensible metaphor here is the economy. You can imagine individual programs as companies, producing predictions. The training data roughly corresponds to the “demand”: if there are a lot of sequences like “23,1”, the simple company producing “if the number n is 23, predict 1” predictions can “sell” them and gets rewarded. But there are also costs: roughly, the longer, more complex, and harder to assemble the program is, the higher the “costs” of the corresponding company. When two companies are producing the same predictions, the one with lower costs will win. For example, if the “if the number n is 23, predict 1” company competes with “if the number n is 23, predict -1/(n^2-530)” company, the first wins - their product is the same, but the costs are very different. If the complexity is somewhat similar, it can depend on implementation details - e.g., if a company using “multiplication” heavily competes with another relying on “vector sum”, who will win may depend on whether the underlying architecture makes multiplication or addition costly. What gets learned also depends on the demand. For example, a company producing “modular arithmetic mod 12” will be more successful on internet data than one producing “modular arithmetic mod 11” - simply because the former is how we count hours in a day! Also, exact costs depend on the architecture, but also on other companies already existing in the economy. For instance, if there is already a company supplying a program “predict the next number using linear regression” and a company “convert date format to a number”, it becomes easier to start a new company “predict a time series”. This seemingly simple metaphor is actually sufficient to dissolve a lot of confusion about LLMs “generalizing” vs. “memorizing”. Memorizing means using very simple programs and interpolation. Generalizing means more abstract programs. So, overall, part of the ability of LLMs to generalize depends on how abstract and general programs they learned. Whether the resulting capabilities are "actually general" or not seem mostly a semantic debate, a bit like arguing is some country has truly general economy able to produce novel, unseen things. Another intuition pump: pre-training is a bit like building the companies, growing the whole economy. Fine-tuning is a bit like adjusting assembly lines and orienting the supply chain to some product: much faster, helps to reliably deliver, but does not add fundamentally new general capabilities. (Technically, sensible model is it mostly set priors about what what programs to use, and constants like style) Now, what happens in inference times, when you prompt something like GPT with the sequence? A pretty sensible model is "something like Bayesian inference" : when the sequence e.g. 3,7,11 starts, there are many programs offering opinions on what's the next number, but with every new piece of evidence, some gain, some loose. This seems similar to what humans do with conceptually similar tasks." [[2306.17844] The Clock and the Pizza: Two Stories in Mechanistic Explanation of Neural Networks](https://arxiv.org/abs/2306.17844) The Clock and the Pizza: Two Stories in Mechanistic Explanation of Neural Networks neural network's learned algorithms in algorithmic phase space https://openai.com/research/generative-language-modeling-for-automated-theorem-proving [[2205.12615] Autoformalization with Large Language Models](https://arxiv.org/abs/2205.12615) [[2205.11491] HyperTree Proof Search for Neural Theorem Proving](https://arxiv.org/abs/2205.11491) HyperTree Proof Search for Neural Theorem Proving, transformer based automated theorem proving [Tamper-resistant security module - Wikipedia](https://en.wikipedia.org/wiki/Tamper-resistant_security_module) Transformers rewrite quantum diagrams https://twitter.com/lu_sichu/status/1736235001224610032?t=xK6Swj7f5VQujDlIiLSTHg&s=19 [Teaching small transformers to rewrite ZX diagrams | OpenReview](https://openreview.net/forum?id=btQ7Bt1NLF) I dont think equating Thermodynamic God with Moloch is good equivalence. Safe technology acceleration in a system that benefits all beings is possible! No corporate cyberpunk authoritarian dystopia (strenghtening democracy), no rogue AIs (advance mechanistic interpretability), wars (minimize incentives for them) etc., we can mitigate that and steer it by systematically not allowing it without stopping intelligence explosion and singularity and all its benefits! Status quo seems to be in its current form unsustainable! Let's make trillions of future sentient beings flourish using technology! https://twitter.com/burny_tech/status/1736258598819291398?t=Q8wUU3mYIXG6DzsV67Jn3A&s=19 Moloch is a *destroyer* of complexity, as it tends to push maximisation of a single metric/thing to an extreme, the end point of which is ultimately a *low* complexity, high entropy and thus undesirable state. E/acc seem to want high entropy. [[2210.10749] Transformers Learn Shortcuts to Automata](https://arxiv.org/abs/2210.10749) Transformers Learn Shortcuts to Automata Window car body wheels detectors combine into car [Concrete open problems in mechanistic interpretability | Neel Nanda | EAG London 23 - YouTube](https://youtu.be/7t9umZ1tFso?si=T4nHzw9AyTqvXRVW) 12:00 Is ZFC consistent? [set theory - Does anyone still seriously doubt the consistency of $ZFC$? - MathOverflow](https://mathoverflow.net/questions/437195/does-anyone-still-seriously-doubt-the-consistency-of-zfc) curvature of spacetime in general relativity, incompatibility of general relativity with quantum mechanics, math breaking at black hole singularities [ChatGPT](https://chat.openai.com/share/bf710fb0-a7f7-4c15-9aaf-92d7430df255) [Schwarzschild metric - Wikipedia](https://en.wikipedia.org/wiki/Schwarzschild_metric) [Einstein–Hilbert action - Wikipedia](https://en.wikipedia.org/wiki/Einstein%E2%80%93Hilbert_action) Automata, Turing machines, languages, incompleteness theorems, ZFC [ChatGPT](https://chat.openai.com/share/7606104c-f100-4ef2-99d7-7d1b40f03a06) [GitHub - c3di/neuroscope: Neuroscope:An Explainable AI Toolbox for Semantic Segmentation and Image Classification of Convolutional Neural Nets](https://github.com/c3di/neuroscope) Ask not just x is good in RLHF, but also why its good [Concrete open problems in mechanistic interpretability | Neel Nanda | EAG London 23 - YouTube](https://youtu.be/7t9umZ1tFso?si=CwKqTaC6OYC46p66) Automating mechanistic interpretability using ML Global brain spiral dynamics https://twitter.com/SteinmetzNeuro/status/1736274396644540796?t=KxAkGZ8hmvMMbaLWMgXDag&s=19 [[2310.02207] Language Models Represent Space and Time](https://arxiv.org/abs/2310.02207) language models reoresent Space And time [[2210.13382] Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task](https://arxiv.org/abs/2210.13382) [[2310.07582] Linear Latent World Models in Simple Transformers: A Case Study on Othello-GPT](https://arxiv.org/abs/2310.07582) OtherlloGPT learns emergent nonlinear internal representation of the board state >Interventional experiments indicate this representation can be used to control the output of the network and create "latent saliency maps" that can help explain predictions in human terms. >The investigation reveals that Othello-GPT encapsulates a linear representation of opposing pieces, a factor that causally steers its decision-making process https://www.lesswrong.com/posts/c6uTNm5erRrmyJvvD/mapping-the-semantic-void-strange-goings-on-in-gpt-embedding Transformers with cot are Turing complete https://twitter.com/lambdaviking/status/1736040717498069192?t=8zchY9PaWnv_4ujyzsM2AQ&s=19 Actually when i think about it we might have to get really close to brain using neuromorphic or similar hardware for humanlike way of energy efficiency/speedup, or close to thermodynamic computing for really digesting every single computation out of physics using stochastic bits that few groups work on automated circuit discovery [A Walkthrough of Automated Circuit Discovery w/ Arthur Conmy Part 1/3 - YouTube](https://www.youtube.com/watch?v=dn4GqR0DCx8) they automated [[2211.00593] Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small](https://arxiv.org/abs/2211.00593) learned Circuit for Indirect Object Identification in GPT-2 small, pretty complex interacting head circuits machinery https://imgur.com/IXh43wo , which fills marry in this task https://imgur.com/m0DajC4 [Concrete open problems in mechanistic interpretability | Neel Nanda | EAG London 23 - YouTube](https://youtu.be/7t9umZ1tFso?si=wK7asKcCTxOBbmI1&t=2850) neurons also encode much more nonlinearities than neurons in current deep learning nets [Dendrites: Why Biological Neurons Are Deep Neural Networks - YouTube](https://www.youtube.com/watch?v=hmtQPrH-gC4) Now it feels like to me most people equate stochastic parrot with memorization which definitely isnt universally the case I like the compression model of intelligence too, minimizing kologomov complexity of circuits To what extend is human intelligence driven by the hardware or software I'm very agnostic The more LLMs advance, the less it seems to me that hardware is fundamentally significiant in terms of what algorithms for intelligence are effectively possible Energy, data, algoritmic efficiency, capabilities limitations etc. is lowering exponentially with all the software tricks on all levels and hardware I wish for some proper benchmarks on this To what extend is human intelligence driven by the hardware or software I'm very agnostic The more LLMs advance, the less it seems to me that hardware is fundamentally significiant in terms of what algorithms for intelligence are effectively possible Energy, data, algoritmic efficiency, capabilities limitations etc. is lowering exponentially with all the software tricks on all levels and hardware I wish for some proper benchmarks on this Is brain's neural field theory as very important for intelligence? the resonance hardware architecture is what is needed for effective humanlike intelligence ? With holistic field computing Or forward forward algorithm needed i think intelligence is a lot hardware agnostic