GLM-5.2 moves from GRPO (Group Relative Policy Optimization) to PPO (Proximal Policy Optimization)

On X: [Burny - Effective Curiosity on X: "In long-horizon agentic reinforcement learning for LLMs, GLM-5.2 moves from GRPO (Group Relative Policy Optimization) to PPO (Proximal Policy Optimization). DeepSeek in their original GRPO paper showed that GRPO can allow to do pretty much the same as PPO, without the critic. https://t.co/ZoTsbAQIDz" / X](https://x.com/burny_tech/status/2068395760509460768) On Substack: [GLM-5.2 moves from GRPO (Group Relative Policy Optimization) to PPO (Proximal Policy Optimization)](https://substack.com/@burny/p-204898029) GLM-5.2 was just released, and it's currently the best open-weight LLM. In long-horizon agentic reinforcement learning for LLMs, GLM-5.2 moves from GRPO (Group Relative Policy Optimization) to PPO (Proximal Policy Optimization). DeepSeek in their original GRPO paper showed that GRPO can allow to do pretty much the same as PPO, without the critic. And GRPO became very popular in LLM RLVR broadly (reinforcement learning with verifiable rewards). But this GRPO vs PPO, or critic free vs critic methods, advantage doesnt seem to be universal and depends on the task distribution you're in, etc.. GLM-5.2 is here doing compaction heavy long horizon agentic reinforcement learning with variable number of trainable sub-traces per prompt where its harder to do comparisions. PPO seems to be more natural fit here. So GRPO and PPO (or critic and critic free methods) here are differently shaped hammers for differently shaped nails. But the boundary is fuzzy. In GLM-5.2's report https://z.ai/blog/glm-5.2 : They describe a long-horizon agentic RL setting where "long-horizon tasks produce substantially longer execution traces." and "once a super-long trajectory is split by compaction into multiple sub-traces, different rollouts under the same prompt yield different numbers of trainable traces with highly variable lengths." Because of this, they "move from group-wise optimization (GRPO) to a critic-based PPO formulation that learns from individual rollouts," relying on "a critic to estimate token-level advantages rather than group-relative comparisons." In this regime, instead of having a clean fixed-size group of comparable whole responses, they have a compacted sub-traces whose number and length can vary across rollouts from the same prompt. They say "this single-rollout formulation fits compaction naturally" because it "places no constraint on how many traces a prompt produces or on their relative lengths." And they "bring compaction into training by including all compacted sub-traces as trainable trajectories," and apply "a token-level loss to address their length imbalance." [[Images/862775d5b0dece803806ad10de16811b_MD5.jpg|Open: Pasted image 20260703141135.png]] ![[Images/862775d5b0dece803806ad10de16811b_MD5.jpg]]