Thoughts AI technical 16

" Could a major opportunity to improve representation in deep learning be hiding in plain sight? Check out our new position paper: Questioning Representational Optimism in Deep Learning I wonder if differently setup deel learning architecture and training algorithm and pipeline could get to similar beautiful representations https://fxtwitter.com/NickEMoran/status/1924888905523900892?t=AH_UBS0KbzFHD5amvp7JjQ&s=19 https://fxtwitter.com/kenneth0stanley/status/1924650134299939082?t=3WQ9qlaxJ_fuueRl57UE8A&s=19 And I also wonder if it's better to frame each type of representation as having different advantages and disadvantages. Both unified factored representations and entangled representations in superposition. ' - features trained on a single image suck, - using a method with implicit heavy regularization on the features makes them more smooth, Summarized that for you ' And this guy might have interesting point: ' I read the intro and not convinced about some of the fracturing arguments. Paragraphs seems to contradict, e.g. unified factored independent representations is just "if statement for every case". >change hair color might also cause the foliage in the background to change as well Isn't a fractured representation, its a unfiied represntation of coloring, just wrong and should have been fractured more. Creativity is a huge use case for fractured representations. You want to mix independent ideas. Steve jobs did LSD to boost his creativity. Repeating neuron representations is generally good for learning (e.g. dropout). Redundant circuits boost robustness. And redundant circuits early in training can specialist later on. And the biological argument that the brain is very fractured. Information is duplicated and stored through the entire brain. There are many redundant circuits that do the same thing. There is no unified representation. I have an intuition about this. In evolutionary processes, they're stochastic, so if you have really entangled representations, they're very delicate, and will be heavily disrupted by the introduction of any noise, so the representations simplify and map closer (or more directly) to the data. I guess. Sorry, I'm not explaining it super well but it makes sense in my head. But if my intuition is correct, any stochasticity will encourage this I forget which paper it was, but there was a paper recently that noted the performance of modern LLMs (per parameter used) was closely tied to the very entangling of representations that this paper argues against And they argued that we still have orders of magnitudes more performance per parameter if we can encourage the entalgement of concepts So...Depending on who you ask, either A) Stop using dropout B) Use a ton of dropout Yeah, superposition, sorry, that's what I meant ' "We found that when superposition is weak, meaning only the most frequent features are represented without interference, the scaling of loss with model size depends on the underlying feature frequency; if feature frequencies follow a power law, so does the loss. In contrast, under strong superposition, where all features are represented but overlap with each other, the loss becomes inversely proportional to the model dimension across a wide range of feature frequency distributions." Superposition Yields Robust Neural Scaling: [[2505.10465] Superposition Yields Robust Neural Scaling](https://arxiv.org/abs/2505.10465) ' But yeah, they both note that stochasticity in the training process seems to encourage simpler representations, or at least ones that are more robust to perturbation. Presumably, having a fairly linear relationship to the data helps with that ' On the other hand: https://fxtwitter.com/fabmilo/status/1924656105340469595?t=C5ot2mUyNmOkRrUaYzyS6w&s=19 Fabrizio Milo (@fabmilo) @kenneth0stanley @hardmaru The messy network reminds me of so many software architecture frameworks and stacks. Stratified evolution can’t be optimized globally it seems. https://fxtwitter.com/kenneth0stanley/status/1924656383259246972?t=7HwEpBdsIByzfYa2eX03xQ&s=19 Kenneth Stanley (@kenneth0stanley) @fabmilo @hardmaru Yes good observation! In fact, we note in the paper that this kind of fractured entangled representation is like poorly written code. It's a good metaphor. I would be curious how he sees this paper about superposition yielding robust neural scaling in relation to his work Maybe these software architectures reflect our cognition I have a feeling that there is some sweet spot that maximizes the advantages and minimizes disadvantages of both unified factored representations and entangled representations in superposition to get more robust generalizing circuits that could be studied using methods from mechanistic interpretability " pokud to jde matematicky specifikovat jako reward, tak je šance, že reinforcement learning to dokáže ten vědecký pokrok v AI díky tomu AI boomu je zatím mega sice strašně koncetrovaný na jeden typ AI systémů, ale i tak je tam pořád nějaká diverzita v pozadí co není moc vidět, co je ale díky tomu boomu taky podporovaná vznikají díky tomu nějaký pokusy o formalizace inteligence vznikají díky tomu pokusy aplikaci všech těhle technologií ve fyzice, biologii, zdravotnictví to je co mě interesuje nejvíc no děje se to taky, děje se toho tolik nebo to že dokážeme do latentního prostoru o trillionech dimenzí takhle relativně dobře zakódovat takový šílený množství vědomostí, a s tím pak dál fungujovat, je naprosto mindblowing They discuss how reinforcement learning helps in zooming in on the more likely correct solutions, and at the same time the more RL scales, the more novel patterns can emerge, like the new patterns that emerged in AlphaZero (superhuman chess strategies) [https://youtu.be/64lXQP6cs5M?si=a7-Ly7xdd9MGyoXl](https://youtu.be/64lXQP6cs5M?si=a7-Ly7xdd9MGyoXl) Tady ještě podobný věci řeší v kontextu reinforcement learningu, že jednu věc čemu reinforcement learning pomáhá je zooming in na ty víc pravděpodobněji správnější možnosti, a zároveň čím víc se RL škáluje, tím novější vzory můžou vznikat, jako vznikly nový vzory v AlphaZero (superhuman chess strategies) 9:20 Extremely quality data and best reinforcement learning setups are currently the biggest moats. That's why Google started winning. They have the best history and access to both. And I also wonder if it's better to frame each type of representation as having different advantages and disadvantages. Both unified factored representations and entangled representations in superposition. Could a major opportunity to improve representation in deep learning be hiding in plain sight? Check out our new position paper: Questioning Representational Optimism in Deep Learning I wonder if differently setup deel learning architecture and training algorithm and pipeline could get to similar beautiful representations https://fxtwitter.com/NickEMoran/status/1924888905523900892?t=AH_UBS0KbzFHD5amvp7JjQ&s=19 https://fxtwitter.com/kenneth0stanley/status/1924650134299939082?t=3WQ9qlaxJ_fuueRl57UE8A&s=19 Superposition yielding robust neural scaling [[2505.10465] Superposition Yields Robust Neural Scaling](https://arxiv.org/abs/2505.10465) Maybe these software architectures reflect our cognition I have a feeling that there is some sweet spot that maximizes the advantages and minimizes disadvantages of both unified factored representations and entangled representations in superposition to get more robust generalizing circuits that could be studied using methods from mechanistic interpretability Will scaling inference time training be the next bitter lesson? [https://arxiv.org/abs/2401.11504](https://arxiv.org/abs/2401.11504) Future is multiagentic reinforcement learning I wanna see more mechanistic interpretability for models doing math.