Owen Oertell, Wenhao Zhan*, Gokul Swamy, Zhiwei Steven Wu, Kiante Brantley, Jason Lee, Wen Sun*
<aside>
Recent work has shown that for particular combinations of base model and training algorithm, reinforcement learning with random rewards (RLRR) improves the performance of LLMs on certain math reasoning benchmarks. This result is surprising as the (expected) policy gradient is exactly zero for RLRR, as all policies look the same under a random reward function. In response, we use RLRR as a diagnostic task for evaluating how well different classes of RL algorithms follow this true policy gradient. First, we demonstrate that algorithms that follow the (natural) policy gradient (e.g. RLoo (Kool et al., 2019) or REBEL (Gao et al., 2024)) produce the expected behavior of performance staying flat with random rewards, only increasing when provided with ground-truth rewards. Second, we show that rather than holding steady, heuristic policy gradients like PPO (Schulman et al., 2017) and GRPO (Shao et al., 2024) can either increase or decrease the reasoning performance of the model considerably. Third, we demonstrate than on a didactic bandit problem — a problem that has nothing to do with LLMs or reasoning — GRPO exhibits a bias towards choices that were more likely under the base policy, while the vanilla REINFORCE policy gradient (Williams, 1992) has no such tendencies. Taken together, our results underscore the importance of the choice of RL algorithm when making claims about LLM reasoning and beyond.
Code: https://github.com/Owen-Oertell/heuristics_harmful
W&B Logs: https://wandb.ai/cornell-npg/random-rewards-reasoning
</aside>
Recent work from Shao et al. (2025) shows that for certain specific base models and settings (e.g., Qwen base models with a chat template), the RL algorithm GRPO (Shao et al., 2024) can improve the performance of the policy on certain mathematical reasoning tasks, even when provided with random rewards. This observation is surprising as, by construction, the reward function provides no information about the quality of the actual generations. A bit more formally, when performing RL with random rewards (RLRR), all policies look equally good. Thus, the true policy gradient is exactly zero in expectation, as there is no direction one can change the policy that would change performance (Sutton et al., 1999). One should therefore expect a flat training curve averaged over seeds, rather than a clear performance change.
In response, we ablate the dependence of this phenomenon on the choice of RL algorithm used for the policy update, using RLRR as a diagnostic task to elucidate the behavior of different kinds of RL algorithms. Specifically, we compare two classes of algorithms. The first, (natural) policy gradients (e.g. RLoo (Kool et al., 2019), REBEL (Gao et al., 2024)), can be interpreted as following an unbiased estimate of the gradient of the expected performance of the policy with respect to the policy parameters. The second, heuristic policy gradients (e.g. PPO (Schulman et al., 2017), GRPO (Shao et al., 2024)), are not always unbiased estimates of the true gradient due to heuristics like clipping or normalization. We provide evidence that the inclusion of these heuristics is what is leading to unexpected changes in policy performance on certain tasks.
We conduct experiments on the MATH dataset using Qwen2.5 MATH 7B models. We find that, as expected, RLoo / REBEL only change performance with the ground-truth reward and preserve policy performance with random rewards. In contrast, we find that the heuristic policy gradients PPO and GRPO exhibit the unexpected behavior of dramatically changing policy performance with random rewards. For example, GRPO improves the performance of the policy under one random seed but decreases it on another of the seeds we ran. Our results indicate that these heuristic policy gradients are not following the true policy gradient, and therefore should be used with caution when making and evaluating claims about LLM reasoning.
We then attempt to further isolate the core phenomena of interest by performing experiments on a didactic bandit problem — i.e. a problem that has nothing to do with LLMs nor reasoning. Specifically, we construct a two-armed bandit problem where both arms are equally good in expectation. While the standard policy gradient REINFORCE (Williams, 1992) behaves as expected, we surprisingly see GRPO exhibit a bias towards whichever action was more likely under the base policy, echoing some findings in the appendix of Shao et al. (2025). These results imply that heuristic PGs can cause unexpected behavior, even on the simplest of RL problems.
Taken together, our results underscore the importance of the choice of RL algorithm when making or evaluating potentially surprising claims about LLM reasoning and beyond.
We compare four policy gradient algorithms in this note. The first, RLoo (Kool et al., 2019), is the REINFORCE policy gradient with a leave-one-out estimator for the baseline:
$\nabla_{\theta} J(\pi_{\theta}) = \nabla_{\theta} \mathbb{E}{y \sim \pi{\theta}(x)}[r(x, y)]\approx \frac{1}{k} \sum_i^k \nabla_{\theta} \log \pi_{\theta}(y_i| x)\left(r(x, y_i)-\frac{1}{k-1}\sum_{j \neq i}r(x, y_j)\right)$
Observe that this is an unbiased estimate of the vanilla / true policy gradient.
The second, REBEL (Gao et al., 2024), can be seen as a generalization of the Natural Policy Gradient (Kakade, 2002). REBEL solves the following regression problem at each round:
$\argmin_{\theta}\mathbb{E}{\mathcal{D}}\left (\frac{1}{\eta}\ln \frac{\pi{\theta}(y|x)}{\pi_{t-1}(y|x)} - \ln \frac{\pi_{\theta}(y'|x)}{\pi_{t-1}(y'|x)} -r(x, y) + r(x, y') \right)^2$
As argued by Gao et al., 2024, this can be seen as a principled, regression-based approximation of the idealized online mirror descent update that NPG also attempts to approximate. In particular, if one solves this regression problem exactly, one will exactly implement the OMD update. If one solves this regression problem with one step of the Gauss-Newton method, it recovers the natural policy gradient update. Gao et al., 2024 also analyze how regression error translates to policy performance.
We now introduce our two heuristic policy gradient algorithms. The first, PPO (Schulman et al., 2017), uses clipped importance weights to enable a greater degree of off-policyness than vanilla REINFORCE, at the cost of potentially not being equal to the true policy gradient after multiple gradient steps on the same batch of data:
$\nabla_{\theta} \mathbb{E}{y \sim \pi{t-1}} \left[\min\left( \frac{\pi_{\theta}(y|x)}{\pi_{t-1}(y|x)} A^{t-1}(x, y), \text{clip}\left(\frac{\pi_{\theta}(y|x)}{\pi_{t-1}(y|x)},1-\epsilon, 1+\epsilon\right)A^{t-1}(x, y)\right)\right]$
The second, GRPO (Shao et al., 2024), simplifies the above update by removing the need for a learned advantage estimate / critic by using an advantage estimate similar to RLoo:
$\hat{A}_{i, t} = \frac{r(x, y_i) - \text{mean}(r(x, y_1), \dots, r(x, y_k))}{\text{std}(r(x, y_1), \dots, r(x, y_k))}$
GRPO still uses clipping, which means there are situations in which it will not match the true policy gradient in expectation. There are additional differences in terms of how the KL divergence to the reference policy is estimated between standard PPO and GRPO implementations — please see the above papers for more details.