Against RL: The Case for System 2 Learning
People are excited about Deepseek R1 (and O1 and O3). It seems like RL might finally be starting to actually work for LLMs, not just the weak approximation that is RLHF.
But for building safe superintelligent systems, RL is at best an intermediate step, and at worst an incredibly unsafe