3.6 From Value to Policy
What This Section Solves
Core content
- Limits of value-based RL: is not directly solvable in continuous or extremely large action spaces.
- Parameterized policies: model behavior directly with instead of an explicit table.
- Policy objective and gradients: define as the policy's expected return; the policy gradient theorem gives an optimization direction.
- REINFORCE and variance: update from trajectory returns, but variance is high and baselines are needed for stability.
The previous section focused on value-based reinforcement learning: we store an estimate for each state-action pair, and derive a policy by
This works well when the action space is small. In CartPole there are only two actions (push left / push right), so comparing values is easy.
But the key operation above hides an assumption:
actions must be enumerable.
Once the action space becomes continuous or too large, this assumption breaks. A robot arm might output torques for 6 joints:
You can train a network that scores actions, but the real difficulty comes next: how do you solve
over infinitely many ?
In autonomous driving, throttle and steering are continuous. In language modeling, the action space is a vocabulary with tens of thousands of tokens. In these settings, “score every action then take the maximum” becomes computationally and algorithmically awkward.
So if what we ultimately need is an agent that acts, why insist on learning a score table first?
The core idea is:
model the policy directly.
Instead of deriving behavior indirectly from a value function, we let a policy network output an action distribution (or a continuous action) and optimize its parameters using the return signal from the environment. This is the policy-based route.
Core concept
There are two common routes in RL.
Value-based methods learn action values and choose actions via (Q-learning, DQN). They are a natural fit when actions are few and comparable, and they often reuse old data efficiently via replay buffers.
Policy-based methods learn a policy directly: the probability of choosing action in state (REINFORCE and policy gradients). They optimize the policy's expected return:
Intuitively, actions that occur in high-return trajectories increase in probability, and actions in low-return trajectories decrease. Because the policy outputs a distribution (or continuous actions) directly, this route is well-suited to robotics, autonomous driving, and LLM generation. Pure policy gradients are often more on-policy because they rely on fresh samples from the current policy. [1][^5][^6]
From Action Scores to a Behavior Distribution
In value-based RL, we estimate and choose the larger value.
Now consider a different representation. Instead of asking “which action has the highest score?”, we model how the agent behaves: what distribution over actions does it follow in state ?
For example, a policy might specify: in state , push left with probability 70% and push right with probability 30%. In environments with more actions, the policy is a full distribution across all choices.
Early in training, this distribution is usually wrong. The agent interacts with the environment using the current distribution, receives returns, and adjusts the distribution:
- actions that frequently appear in high-return trajectories should become more probable in the corresponding states;
- actions that frequently appear in low-return trajectories should become less probable.
So the policy route shifts the question: from “which action scores highest” to “how should I act”.
Parameterized Policies
To learn from data, we introduce parameters into the policy:
Here are the learnable parameters (typically neural network weights). is the state and is the chosen action at time .
For discrete action spaces, the network often outputs a preference score (logit) for each action. We convert logits into a valid probability distribution using softmax:
For continuous control, we cannot enumerate actions, so a common choice is to represent the policy as a continuous distribution, such as a Gaussian:
The network outputs and : the mean encodes the most likely action and the standard deviation controls stochasticity (and thus exploration).
Note the difference in what and represent:
evaluates the long-term value of taking action in state . defines the probability of choosing in .
The Policy Objective
Once we have a parameterized policy, the next question is: are these parameters good?
In value-based RL we compare actions inside a single state. Now we zoom out: we want to evaluate the entire policy.
If initial states are drawn from a distribution , a standard objective is
Equivalently, in trajectory form, let
and
Then
Read it literally:
The learning problem becomes a direct optimization problem:
Compare this with value-based selection:
- chooses an action within a state.
- chooses policy parameters across all behaviors.
Trajectory Probability (Why Policy Gradients Are Subtle)
If we want to optimize , the natural next step is to take a gradient:
But unlike supervised learning, policy optimization does not operate on a fixed dataset. When changes, action probabilities change, which changes visited states, which changes the trajectories we sample.
So it helps to write the probability of a trajectory under parameters :
Reading left to right:
- is the probability of the initial state.
- is the policy's action probability.
- is the environment transition probability.
Only the policy terms depend on ; the initial-state distribution and environment dynamics are properties of the environment.
In the next chapter (policy gradients), we will use this structure to derive REINFORCE and the policy gradient theorem, and then discuss variance reduction (baselines and advantage functions).
The Log-Derivative Trick
To optimize the policy objective, write it as a sum over trajectories:
Now take a gradient:
Once a trajectory is fixed, its states, actions, and rewards are fixed. The return is treated as a constant with respect to . What changes with is the probability of sampling that trajectory.
The expression still contains , which is awkward for sampling. We want a form like , because then we can estimate the expectation by drawing trajectories from the current policy.
The key identity is:
It follows from
Substituting gives
or equivalently,
Now use the trajectory probability from the previous section:
Only the policy terms depend on . Therefore,
This yields the basic policy gradient form:
Usually, the action at time should be credited only for rewards that occur after it, so we replace the full trajectory return with the return-to-go:
The more common form is therefore
The term points in the direction that makes the sampled action more likely. The scalar decides how strongly to push in that direction.
The plain-language reading is:
If an action was followed by high return, increase its probability in similar states. If it was followed by low return, decrease it.
A Two-Action Example
Consider a one-state task with two actions, A and B. Let one parameter control the probability of A:
For this parameterization,
and
Suppose .
If the agent samples A and later receives , the update direction is proportional to
The gradient is positive, so increases and A becomes more probable.
If the agent samples B and later receives , the update direction is proportional to
The gradient is negative, so decreases, which lowers the probability of A and raises the probability of B.
So policy gradients do not blindly reinforce every action. They reinforce the action that was actually sampled, and the direction depends on the return that followed it.
CartPole and REINFORCE
In CartPole, the state is a 4D vector and the action space has two choices: push left or push right.

REINFORCE applies the policy gradient idea directly:
- run one full episode using the current policy ;
- record states, actions, log probabilities, and rewards;
- compute return-to-go for every step;
- update parameters with
A minimal PyTorch implementation has the following structure:
env = gym.make("CartPole-v1")
policy = PolicyNet()
optimizer = torch.optim.Adam(policy.parameters(), lr=1e-2)
gamma = 0.99
for episode in range(500):
state, _ = env.reset()
log_probs = []
rewards = []
done = False
while not done:
action, log_prob = policy.select_action(state)
next_state, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
log_probs.append(log_prob)
rewards.append(reward)
state = next_state
returns = []
G = 0
for r in reversed(rewards):
G = r + gamma * G
returns.insert(0, G)
returns = torch.tensor(returns, dtype=torch.float32)
returns = (returns - returns.mean()) / (returns.std() + 1e-8)
loss = -sum(lp * Gt for lp, Gt in zip(log_probs, returns))
optimizer.zero_grad()
loss.backward()
optimizer.step()The negative sign appears because PyTorch optimizers minimize losses, while policy gradient is a gradient-ascent method.
Sparse Rewards: Why REINFORCE Can Struggle
CartPole gives a reward at every step, so the policy receives frequent feedback. Many environments are not like this.
MountainCar is a useful contrast: the car starts in a valley and must build momentum to reach the hilltop.

The reward is usually per step until success, and a random policy almost never reaches the goal. Then most sampled trajectories look equally bad. If every episode returns roughly , REINFORCE has little information about which early actions were useful.
This is the sparse-reward problem. TD methods and Q-learning can sometimes propagate value backward from rare successful states more efficiently, while pure Monte Carlo policy gradients must wait for whole-episode returns.
Compared With Q-Learning
The two routes solve different problems well:
| Aspect | Q-Learning | REINFORCE |
|---|---|---|
| Learns | action values | policy parameters |
| Update timing | every step | after an episode |
| Data style | often off-policy | on-policy |
| Action spaces | best for finite discrete actions | works for discrete or continuous policies |
| Strength | sample reuse and TD propagation | direct policy optimization |
| Weakness | hard in continuous spaces | high variance |
They are not enemies. Actor-Critic methods combine them: an Actor learns the policy, while a Critic estimates values to stabilize the policy update.
High Variance and Baselines
Policy gradients can be noisy because returns vary across episodes. In CartPole, one rollout might last 190 steps and another 40 steps under nearly the same policy. Multiplying log-probability gradients by raw returns can make updates unstable.
A more precise question is not:
but:
This leads back to value functions. The state value can serve as a baseline: the expected return from state under the current policy. The advantage function is
If , action is better than average in state and should become more likely. If , it is worse than average and should become less likely.
This is the basic motivation for Actor-Critic:
- the Actor chooses actions;
- the Critic or provides a lower-variance learning signal.
REINFORCE uses sampled returns directly. Actor-Critic replaces raw returns with value or advantage estimates. PPO then adds a constraint that prevents each policy update from moving too far.
Relationship to Neighboring Sections
The previous section introduced value-based RL: learn , then act using . This works naturally when the action space is small and enumerable.
This section introduced policy-based RL: learn directly, then optimize expected return . This is natural for continuous control, large discrete spaces, and stochastic policies.
The next section asks where the data comes from. Both value updates and policy gradients require trajectories, transitions, or preference data. Whether those data are on-policy, off-policy, online, or offline changes the algorithmic regime.
Summary
- Value-based methods learn and choose actions using .
- Policy-based methods learn directly.
- The policy objective is .
- The trajectory probability separates policy terms from environment terms.
- The log-derivative trick turns into an expectation that can be estimated from sampled trajectories.
- REINFORCE is the most direct policy-gradient algorithm, but it has high variance.
- Baselines, advantages, Actor-Critic, and PPO are progressively more stable ways to use the same objective.
Previous: Action-Value Functions
Next: Data Sources
References
Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8, 229-256. ↩︎