Thoughts 11 - Burny

New OpenAI o1 model "We've developed a new series of AI models designed to spend more time thinking before they respond. They can reason through complex tasks and solve harder problems than previous models in science, coding, and math." [Introducing OpenAI o1 | OpenAI](https://openai.com/index/introducing-openai-o1-preview/) [Learning to Reason with LLMs | OpenAI](https://openai.com/index/learning-to-reason-with-llms/) [o1 System Card | OpenAI](https://openai.com/index/openai-o1-system-card/) [OpenAI Strawberry Livestream - Metaprompting, Cognitive Architecture, Multi-Agent, Finetuning - YouTube](https://www.youtube.com/live/AO7mXa8BUWk) [ChatGPT o1 - In-Depth Analysis and Reaction (o1-preview) - YouTube](https://www.youtube.com/watch?v=7J44j6Fw8NM) [GPT-o1 - by Zvi Mowshowitz - Don't Worry About the Vase](https://thezvi.substack.com/p/gpt-4o1) [Reddit - Dive into anything](https://www.reddit.com/r/singularity/comments/1ff7mod/openai_announces_o1/) trained with reinforcement learning, chain of thought, correction, sampling with scoring function, etc. https://openai.com/index/openai-o1-system-card/ my thoughts [OpenAI o1 Strawberry Q* AI reasoning LLM model destroys Claude 3.5 Sonnet on reasoning, mathematics! - YouTube](https://www.youtube.com/watch?v=MBxcKY6he1c) https://x.com/burny_tech/status/1834650814419550368 https://x.com/burny_tech/status/1834651772637384712 benchmarks https://x.com/burny_tech/status/1834283752346005926 dominating basically every bench mark like livebench [LiveBench](https://livebench.ai/) https://x.com/polynoamial/status/1834280155730043108 iq https://x.com/DaveShapi/status/1835117569432224005 https://x.com/maximlott/status/1835043371339202639?t=tucltiRS3VVMw6r3cdeTnA&s=19 [ChatGPT o1 - In-Depth Analysis and Reaction (o1-preview) - YouTube](https://www.youtube.com/watch?v=7J44j6Fw8NM) AI explained has great benchmarks out of distribution https://x.com/kimmonismus/status/1834296216009552341 https://x.com/aidan_mclau/status/1835023308238340460 Eidenbench https://x.com/burny_tech/status/1835091020276437138 OlympicArena reasoning benchmark for o1-preview goes hard https://x.com/sytelus/status/1834352532585676859 https://x.com/polynoamial/status/1835086680266883205 The AI field desperately needs harder evals that take into consideration continued fast progress. https://x.com/burny_tech/status/1834716200586084485 dominating basically every bench mark like livebench [LiveBench](https://livebench.ai/) but Arc has not fallen yet, I wonder how much would the AlphaZero-like RL with selfcorrecting CoT finetuning of o1 on ARC score on ARC And I wonder how valid is this internal Devin benchmark for coding they have, the exponential https://fxtwitter.com/cognition_labs/status/1834292718174077014 The new OpenAI o1 (just preview for now) model still struggling to reason out of distribution (ARC benchmark, SimpleBench benchmark, still problems with some famous puzzles, etc.) makes me think that we will get much better AI models once we figure out much more robust (hardcoded or emergent) first principles reasoning (in hierarchies and graphs and so on), instead of retrieving and synthesizing sometimes brittle weakly generalizing reasoning program chunks from the training data stored in the late space. Maybe scale, better training data and training hacks will cause the emergence general enough, robust enough, all-encompassing enough reasoning engine that will eventually phase shift into first principles reasoning metastable configuration of weights. Public narrative about AI is shifting now https://x.com/burny_tech/status/1835091400985096337 "o1 makes it abundantly clear that only OpenAI's internal models will be truly useful for civilizational-level change. Their internal capabilities now far exceed their publicly shipped products and that gap will continue to grow." https://x.com/SmokeAwayyy/status/1835012208587423788 if they released the full o1 model... maybe in a month tho? https://fxtwitter.com/main_horse/status/1834333269128872365 [[AINews] o1: OpenAI's new general reasoning models • Buttondown](https://buttondown.com/ainews/archive/ainews-o1-openais-new-general-reasoning-models/) authors https://x.com/markchen90/status/1834343908035178594 crushing mathematics and informatics olympiad https://x.com/burny_tech/status/1834327275946361099 https://x.com/burny_tech/status/1834321466105770184 OpenAI observed interesting instances of reward hacking in their new model 🤔 https://x.com/burny_tech/status/1834324288402243655 Well, there goes the “AI agent unexpectedly and successfully exploits a configuration bug in its training environment as the path of least resistance during cyberattack capability evaluations” milestone. https://x.com/davidad/status/1834454815092449299 (Even tho I'm very excited about the new AI model, I think certain risks are very real and that we should also accelerate reverse engineering and mechanistic interpretability of these AI systems research such that we can steer them properly in contexts where we need to steer them!) https://x.com/tensor_fusion/status/1834561918712856603 https://x.com/ShakeelHashim/status/1834292287087485425 How far can inference time compute go? new scaling laws, Bitter lesson makes a comeback https://x.com/burny_tech/status/1834289214776565844 https://x.com/DaveShapi/status/1835117776920334703 In another 6 months we will possibly have o1 (full), Orion/GPT-5, Claude 3.5 Opus, Gemini 2 (maybe with Alphaproof and Alphacode integrated), Grok 3, possibly Llama 4 [Reddit - Dive into anything](https://www.reddit.com/r/singularity/comments/1fgnfdu/in_another_6_months_we_will_possibly_have_o1_full/) This is gonna the hottest winter on record. https://x.com/DaveShapi/status/1834986252359049397 https://x.com/slow_developer/status/1834958266998157547 A couple of PRs to the OpenAI codebase was already authored solely by o1! https://x.com/lukasz_kondr/status/1834643103397167326 Technically initial recursive self-improvement from the new OpenAI o1 model. It made nontrivial contributions to frontier AI research and development. https://x.com/burny_tech/status/1834735949101600770 https://x.com/huybery/status/1834291444540194966 Step change in coding, math, physics! My feed is full of people praising o1 for being much better in math than previous models! I didn't believe LLMs can get so much better in math! I was wrong, once again! Do not underestimate the bitter God of Scale and AlphaZero-like RL! And we have not reached the peak of inference time compute scaling laws! Future will be interesting! Looking towards to more progress in AI x math! https://x.com/burny_tech/status/1834748815913398462 THERE IS NO PLATEAUING! WE'RE JUST GETTING STARTED WITH O1, ALPHAPROOF AND SIMILAR NEW AI SYSTEMS! [AI achieves silver-medal standard solving International Mathematical Olympiad problems - Google DeepMind](https://deepmind.google/discover/blog/ai-solves-imo-problems-at-silver-medal-level/) My feed full of: OpenAI o1 model is step change in mathematics OpenAI o1 model is step change in programming OpenAI o1 model is step change in physics etc. Big https://x.com/burny_tech/status/1834957917033730188 https://x.com/IntuitMachine/status/1835240555028136092 https://x.com/teortaxesTex/status/1834725127029932081 [Terence Tao: "I have played a little bit with OpenAI's new iter…" - Mathstodon](https://mathstodon.xyz/@tao/113132502735585408) terrence tao step change from completely incompetent graduate student to incompetent graduate student, first time identifying and using Cramer's theorem https://x.com/omarsar0/status/1834315401812910195 simple math https://x.com/burny_tech/status/1834350997256187971 [Learning to Reason with LLMs | Hacker News](https://news.ycombinator.com/item?id=41523070) fixing bluetooth protocol https://x.com/anderssandberg/status/1834536105527398717 math https://x.com/robertghrist/status/1834564488751731158 New mathematical proof with o1?! https://x.com/QiaochuYuan/status/1834341057099948170 first LLM i've tested that can compute the fundamental group of the circle https://x.com/emollick/status/1835342797722767592 [ChatGPT o1 preview + mini Wrote My PhD Code in 1 Hour*‚ÄîWhat Took Me ~1 Year - YouTube](https://youtu.be/M9YOO7N5jF8) astrophysics https://x.com/realGeorgeHotz/status/1835228364837470398 programming, it's "a mediocre, but not completely incompetent, software engineer" https://x.com/scottastevenson/status/1834408343395258700 law https://x.com/DeryaTR_/status/1834630356286558336 medical stuff [x.com/aj\_dev\_smith/status/1835521394659983477](https://x.com/aj_dev_smith/status/1835521394659983477) music https://x.com/holdenmatt/status/1835031749706785258 happy mathematician [I used o1-mini everyday for coding against Claude Sonnet 3.5 so you don't have to - my thoughts : r/ClaudeAI](https://www.reddit.com/r/ClaudeAI/comments/1fhjgcr/i_used_o1mini_everyday_for_coding_against_claude/) coding https://x.com/AravSrinivas/status/1834786331194802407 prompts where you feel o1-preview outperformed sonnet-3.5 that’s not a puzzle or a coding competition problem but your daily usage prompts. 🧵 implementation details of o1 [GitHub - hijkzzz/Awesome-LLM-Strawberry: A collection of LLM papers, blogs, and projects, with a focus on OpenAI o1 and reasoning techniques.](https://github.com/hijkzzz/Awesome-LLM-Strawberry) Democratization of o1 is happening [Learning to Reason with LLMs | OpenAI](https://openai.com/index/learning-to-reason-with-llms/) trained with reinforcement learning, chain of thought, correction, sampling with scoring function, etc. inference time compute, maybe hardwired into architecture, maybe looping of tokens through the layers multi times, search, graph of thought, selfcorrection, etc. https://x.com/DrJimFan/status/1834279865933332752 https://x.com/polynoamial/status/1834280155730043108 Large Language Monkeys: Scaling Inference Compute with Repeated Sampling. Brown et al. [[2407.21787v1] Large Language Monkeys: Scaling Inference Compute with Repeated Sampling](https://arxiv.org/abs/2407.21787v1) Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters. [[2408.03314] Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters](https://arxiv.org/abs/2408.03314) https://x.com/rohanpaul_ai/status/1835443326205517910 https://x.com/terryyuezhuo/status/1834286548571095299 ReFT: Reasoning with Reinforced Fine-Tuning [[2401.08967] ReFT: Reasoning with Reinforced Fine-Tuning](https://arxiv.org/abs/2401.08967) Training Large Language Models for Reasoning through Reverse Curriculum Reinforcement Learning [[2402.05808] Training Large Language Models for Reasoning through Reverse Curriculum Reinforcement Learning](https://arxiv.org/abs/2402.05808) https://x.com/iamgingertrash/status/1834297595486675052 tree search distillation + RL post training! https://x.com/rm_rafailov/status/1834291016192360743 Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking [[2403.09629] Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking](https://arxiv.org/abs/2403.09629) https://x.com/laion_ai/status/1834564564601729421 Let's Verify Step by Step [[2305.20050] Let's Verify Step by Step](https://arxiv.org/abs/2305.20050) ['Show Your Working': ChatGPT Performance Doubled w/ Process Rewards (+Synthetic Data Event Horizon) - YouTube](https://www.youtube.com/watch?v=hZTZYffRsKI) [Reddit - Dive into anything](https://www.reddit.com/r/LocalLLaMA/comments/1fgr244/reverse_engineering_o1_architecture_with_a_little/) [Reverse engineering OpenAI’s o1 - by Nathan Lambert](https://www.interconnects.ai/p/reverse-engineering-openai-o1) When Can LLMs Actually Correct Their Own Mistakes? A Critical Survey of Self-Correction of LLMs [[2406.01297] When Can LLMs Actually Correct Their Own Mistakes? A Critical Survey of Self-Correction of LLMs](https://arxiv.org/abs/2406.01297) https://x.com/_xjdr/status/1835352391648158189?t=m0K7Xv5AUUDwcZGKM4LcyQ&s=19 https://x.com/sytelus/status/1835433363882270922?t=Y3AXFaiCS7DCKIUiDOhY2w&s=19 o1 developer AMA summary trying my favorite prompts You are the most knowledgeable polymath multidisciplinary scientist that is a perfect generalist and specializes in everything and knows how everything works. Write a gigantic article about all of science from first principles https://x.com/burny_tech/status/1834334382888218937 quantum gravity https://x.com/burny_tech/status/1834333794477973821 reimann hypothesis https://x.com/burny_tech/status/1834332769129726038 maps https://x.com/burny_tech/status/1834768077717680135 Will it live up to its hype or be the biggest collective blueball in the history of collective blueballs? It lived to its hype https://x.com/burny_tech/status/1834279980744016180 quote: “Many tasks don’t need reasoning” Absolutely cooking some jobs lmao https://x.com/burny_tech/status/1834288198731645422 "OpenAI 's o1 thinks for seconds, but we aim for future versions to think for hours, days, even weeks. Inference costs will be higher, but what cost would you pay for a new cancer drug? For breakthrough batteries? For a proof of the Riemann Hypothesis? AI can be more than chatbots" https://x.com/polynoamial/status/1834280969786065278 How far can scaling go? [Imgur: The magic of the Internet](https://imgur.com/3tDPyT4) jailbreaking got harder You're gonna have to step up your game @elder_plinius https://x.com/burny_tech/status/1834324442123497924 jailbroken, no model is immune to pliny https://x.com/elder_plinius/status/1834381507978280989 https://x.com/DrJimFan/status/1834284702494327197 https://x.com/burny_tech/status/1834291805690503424 https://x.com/burny_tech/status/1834311367454519515 https://x.com/jam3scampbell/status/1834285523546058973 Level 2 Reasoners are here. Next up: Level 3 Agents. https://x.com/SmokeAwayyy/status/1834327038561587279 [Something New: On OpenAI's "Strawberry" and Reasoning](https://www.oneusefulthing.org/p/something-new-on-openais-strawberry) only 4 years ago the best language model in the world was gpt-2 xl. can you imagine where we might be 4 years from now? https://x.com/willdepue/status/1834309302598971834 The advantage of OpenAI having unthrottled, internal access to o1 cannot be overstated. https://x.com/BenjaminDEKR/status/1834322459354337519 https://x.com/burny_tech/status/1834664364990673017 cursor and claude 3.5 sonnet and replit gets less trendy https://x.com/dkardonsky_/status/1834281667512746468 integration with cursor https://x.com/mckaywrigley/status/1834311328045170862 https://x.com/cursor_ai/status/1834665828308205661 the new o1 model looks amazing but luckily it has a phd level intelligence so our jobs are safe for now https://x.com/netcapgirl/status/1834290758930600069 no more patience, jimmy https://x.com/sama/status/1834276403270857021 You all know what this means: the demand for *fast inference compute* is about to explode. https://x.com/tunguz/status/1834366040437895257 Which lab/team will be the next to release a reasoning AI model? https://x.com/tunguz/status/1834363884326490204 scaling works, situational awareness was right, leopold aschenbrenner was right, Just look at the fucking line! https://x.com/jackgwhitaker/status/1834284617165316434 https://x.com/burny_tech/status/1835365831661740398 "get back to work", "ai is thinking!" https://x.com/yonashav/status/1834325806509949077 The goalposts shall keep moving until the Kardashev scale improves https://x.com/BasedBeffJezos/status/1834292166924943457 stochastic parrots can fly so high https://x.com/8teAPi/status/1834321503992869177 future models will think for weeks. don’t die. https://x.com/iruletheworldmo/status/1834330060205294007 We're only beginning to understand this new paradigm of CoT-LLMs. There're so many new phenomena to study, research on it will be very exciting. You know it's a start of something good when your first model (with extra tuning) gets 93% on AIME’24 and does IOI-level coding :). https://x.com/lukaszkaiser/status/1834283634888724563 How many startups were wrecked today? https://x.com/tunguz/status/1834324723250970802 oh husbant, you asked gpt-o1-preview model on api to solve the p vs np problem and it thought about it for a week. our api bill shows a usage of $10K USD and now we are homeress https://x.com/dejavucoder/status/1834316507058168091 Hope you guys have strapped your seatbelts. https://x.com/tunguz/status/1834301242656297138 The year is 2027 and OpenAI just dropped AGI, but no one noticed because it was called gpt-5.5-o3-large2-preview-2027-09-06 https://x.com/tylertracy321/status/1834286741202894985 haha gpus go bitterrr https://x.com/burny_tech/status/1834616064178602321 it's so over bros it has been 12 hours since openai announced o1 and it has so far failed to to solve - Riemann hypothesis - Quantum Gravity - FTL (Faster Than Light travel) - P=NP - Grand Unified Theory - Cure for cancer clearly this shows ai has hit a wall and openai is about to go bankrupt https://x.com/basedjensen/status/1834462070395601094 new paradigm https://x.com/willdepue/status/1834294935497179633 his is what ilya saw, path to AGI https://x.com/WilliamBryk/status/1834614138955526440 It may be that today's large neural networks have enough test time compute to be slightly conscious https://x.com/markchen90/status/1834623248610521523 Intelligence is thermodynamics. https://x.com/BasedBeffJezos/status/1834486894836470199 Everything is thermodynamics. https://x.com/burny_tech/status/1835424059116675470 The holy reasoning war of nerds https://x.com/burny_tech/status/1834721690271858729 may the God of Scaling be on our side https://x.com/burny_tech/status/1834726584927813827 openai research engineer interviews codign questiosn destroyed by o1 https://x.com/burny_tech/status/1834738026691375381 naming is absurd [Imgur: The magic of the Internet](https://imgur.com/4tauXCt) Deep learning is hitting a wall! (But its a bit more neurosymbolic so props to you Garry!) falsified gary marcus prediction of no step change this year https://x.com/GaryMarcus/status/1766871625075409381 https://x.com/mealreplacer/status/1834292016462610507 ML street talk says its not true reasoning as they define it percisely (must be turing complete, must acquire and generate new knowledge) https://x.com/MLStreetTalk/status/1834286363476476391 https://x.com/MLStreetTalk/status/1834293397936394726 [Is o1-preview reasoning? - YouTube](https://www.youtube.com/watch?v=nO6sDk6vO0g) machine qualia will soon become relevant [Imgur: The magic of the Internet](https://imgur.com/3VVGf2H) adding web, math engines and other symblic engines, or for physics, would be even more powerful, perplexity on steroids, AlphaProof on steroids [AI achieves silver-medal standard solving International Mathematical Olympiad problems - Google DeepMind](https://deepmind.google/discover/blog/ai-solves-imo-problems-at-silver-medal-level/) So what is the road to AGI now? 1) Learn all human knowledge and reasoning patterns in distribution, overfit the whole world? With more modalities, which would already be AGI and more, because no single human possesses the sum of all human knowledge and reasoning patterns that you can retrieve from. 2) For more superhuman reasoning performance, more similar RL methods to AlphaZero that require little or zero human input via self-play, which the new OpenAI o1 model partly used by its reward network. 3) Implement more graph of thought iterative reasoning in both training and test-time compute. 4) Synthetic data. Automatic labeling. Massive parallel training in simulations like Nvidia. 5) More scaling. 6) More neurosymbolic approaches like AlphaProof. ,... it will be compute *and* algorithms *and* data