Skip to content

3.6 From Value to Policy

What This Section Solves

Core content

  • Limits of value-based RL: argmaxaQ(s,a)\arg\max_a Q(s,a) is not directly solvable in continuous or extremely large action spaces.
  • Parameterized policies: model behavior directly with πθ(as)\pi_\theta(a \mid s) instead of an explicit QQ table.
  • Policy objective and gradients: define J(θ)J(\theta) as the policy's expected return; the policy gradient theorem gives an optimization direction.
  • REINFORCE and variance: update from trajectory returns, but variance is high and baselines are needed for stability.

The previous section focused on value-based reinforcement learning: we store an estimate Q(s,a)Q(s,a) for each state-action pair, and derive a policy by

a=argmaxaQ(s,a).a = \arg\max_a Q(s,a).

This works well when the action space is small. In CartPole there are only two actions (push left / push right), so comparing QQ values is easy.

But the key operation above hides an assumption:

actions must be enumerable.

Once the action space becomes continuous or too large, this assumption breaks. A robot arm might output torques for 6 joints:

a=(τ1,τ2,,τ6)R6.a=(\tau_1,\tau_2,\ldots,\tau_6)\in\mathbb{R}^6.

You can train a network Q(s,a)Q(s,a) that scores actions, but the real difficulty comes next: how do you solve

argmaxaQ(s,a)\arg\max_a Q(s,a)

over infinitely many aa?

In autonomous driving, throttle and steering are continuous. In language modeling, the action space is a vocabulary with tens of thousands of tokens. In these settings, “score every action then take the maximum” becomes computationally and algorithmically awkward.

So if what we ultimately need is an agent that acts, why insist on learning a score table first?

The core idea is:

model the policy directly.

Instead of deriving behavior indirectly from a value function, we let a policy network output an action distribution (or a continuous action) and optimize its parameters using the return signal from the environment. This is the policy-based route.

Core concept

There are two common routes in RL.

Value-based methods learn action values Q(s,a)Q(s,a) and choose actions via argmaxaQ(s,a)\arg\max_a Q(s,a) (Q-learning, DQN). They are a natural fit when actions are few and comparable, and they often reuse old data efficiently via replay buffers.

Policy-based methods learn a policy πθ(as)\pi_\theta(a\mid s) directly: the probability of choosing action aa in state ss (REINFORCE and policy gradients). They optimize the policy's expected return:

J(θ)=Eτπθ[G0].J(\theta)=\mathbb{E}_{\tau\sim\pi_\theta}[G_0].

Intuitively, actions that occur in high-return trajectories increase in probability, and actions in low-return trajectories decrease. Because the policy outputs a distribution (or continuous actions) directly, this route is well-suited to robotics, autonomous driving, and LLM generation. Pure policy gradients are often more on-policy because they rely on fresh samples from the current policy. [1][^5][^6]

From Action Scores to a Behavior Distribution

In value-based RL, we estimate Q(s,a)Q(s,a) and choose the larger value.

Now consider a different representation. Instead of asking “which action has the highest score?”, we model how the agent behaves: what distribution over actions does it follow in state ss?

For example, a policy might specify: in state ss, push left with probability 70% and push right with probability 30%. In environments with more actions, the policy is a full distribution across all choices.

Early in training, this distribution is usually wrong. The agent interacts with the environment using the current distribution, receives returns, and adjusts the distribution:

  • actions that frequently appear in high-return trajectories should become more probable in the corresponding states;
  • actions that frequently appear in low-return trajectories should become less probable.

So the policy route shifts the question: from “which action scores highest” to “how should I act”.

Parameterized Policies

To learn from data, we introduce parameters into the policy:

πθ(as)=Pθ(At=aSt=s).\pi_\theta(a\mid s)=P_\theta(A_t=a\mid S_t=s).

Here θ\theta are the learnable parameters (typically neural network weights). StS_t is the state and AtA_t is the chosen action at time tt.

For discrete action spaces, the network often outputs a preference score (logit) zθ(s,a)z_\theta(s,a) for each action. We convert logits into a valid probability distribution using softmax:

πθ(as)=exp(zθ(s,a))aexp(zθ(s,a)).\pi_\theta(a\mid s) = \frac{\exp(z_\theta(s,a))} {\sum_{a'}\exp(z_\theta(s,a'))}.

For continuous control, we cannot enumerate actions, so a common choice is to represent the policy as a continuous distribution, such as a Gaussian:

aN(μθ(s),σθ(s)2).a\sim\mathcal{N}\left(\mu_\theta(s),\sigma_\theta(s)^2\right).

The network outputs μθ(s)\mu_\theta(s) and σθ(s)\sigma_\theta(s): the mean encodes the most likely action and the standard deviation controls stochasticity (and thus exploration).

Note the difference in what QQ and πθ\pi_\theta represent:

Q(s,a)vs.πθ(as).Q(s,a)\quad\text{vs.}\quad \pi_\theta(a\mid s).

Q(s,a)Q(s,a) evaluates the long-term value of taking action aa in state ss. πθ(as)\pi_\theta(a\mid s) defines the probability of choosing aa in ss.

The Policy Objective J(θ)J(\theta)

Once we have a parameterized policy, the next question is: are these parameters good?

In value-based RL we compare actions inside a single state. Now we zoom out: we want to evaluate the entire policy.

If initial states are drawn from a distribution ρ0\rho_0, a standard objective is

J(θ)=Es0ρ0[Vπθ(s0)].J(\theta) = \mathbb{E}_{s_0\sim\rho_0} \left[ V^{\pi_\theta}(s_0) \right].

Equivalently, in trajectory form, let

τ=(s0,a0,r0,s1,a1,r1,)\tau=(s_0,a_0,r_0,s_1,a_1,r_1,\ldots)

and

G0=t=0γtrt.G_0=\sum_{t=0}^{\infty}\gamma^t r_t.

Then

J(θ)=Eτπθ[G0]=Eτπθ[t=0γtrt].J(\theta) = \mathbb{E}_{\tau\sim\pi_\theta}[G_0] = \mathbb{E}_{\tau\sim\pi_\theta} \left[ \sum_{t=0}^{\infty}\gamma^t r_t \right].

Read it literally:

J(θ)=the expected long-term return achieved by the current policy parameters θ.J(\theta)=\text{the expected long-term return achieved by the current policy parameters } \theta.

The learning problem becomes a direct optimization problem:

θ=argmaxθJ(θ).\theta^*=\arg\max_\theta J(\theta).

Compare this with value-based selection:

  • argmaxaQ(s,a)\arg\max_a Q(s,a) chooses an action within a state.
  • argmaxθJ(θ)\arg\max_\theta J(\theta) chooses policy parameters across all behaviors.

Trajectory Probability (Why Policy Gradients Are Subtle)

If we want to optimize J(θ)J(\theta), the natural next step is to take a gradient:

θJ(θ).\nabla_\theta J(\theta).

But unlike supervised learning, policy optimization does not operate on a fixed dataset. When θ\theta changes, action probabilities change, which changes visited states, which changes the trajectories we sample.

So it helps to write the probability of a trajectory under parameters θ\theta:

Pθ(τ)=ρ0(s0)tπθ(atst)P(st+1st,at).P_\theta(\tau) = \rho_0(s_0) \prod_t \pi_\theta(a_t\mid s_t)\, P(s_{t+1}\mid s_t,a_t).

Reading left to right:

  • ρ0(s0)\rho_0(s_0) is the probability of the initial state.
  • πθ(atst)\pi_\theta(a_t\mid s_t) is the policy's action probability.
  • P(st+1st,at)P(s_{t+1}\mid s_t,a_t) is the environment transition probability.

Only the policy terms πθ(atst)\pi_\theta(a_t\mid s_t) depend on θ\theta; the initial-state distribution and environment dynamics are properties of the environment.

In the next chapter (policy gradients), we will use this structure to derive REINFORCE and the policy gradient theorem, and then discuss variance reduction (baselines and advantage functions).

The Log-Derivative Trick

To optimize the policy objective, write it as a sum over trajectories:

J(θ)=τPθ(τ)G(τ).J(\theta) = \sum_\tau P_\theta(\tau)G(\tau).

Now take a gradient:

θJ(θ)=τθPθ(τ)G(τ).\nabla_\theta J(\theta) = \sum_\tau \nabla_\theta P_\theta(\tau)G(\tau).

Once a trajectory τ\tau is fixed, its states, actions, and rewards are fixed. The return G(τ)G(\tau) is treated as a constant with respect to θ\theta. What changes with θ\theta is the probability of sampling that trajectory.

The expression still contains θPθ(τ)\nabla_\theta P_\theta(\tau), which is awkward for sampling. We want a form like Pθ(τ)()P_\theta(\tau)(\cdots), because then we can estimate the expectation by drawing trajectories from the current policy.

The key identity is:

θPθ(τ)=Pθ(τ)θlogPθ(τ).\nabla_\theta P_\theta(\tau) = P_\theta(\tau)\nabla_\theta\log P_\theta(\tau).

It follows from

θlogf(θ)=θf(θ)f(θ).\nabla_\theta\log f(\theta) = \frac{\nabla_\theta f(\theta)}{f(\theta)}.

Substituting gives

θJ(θ)=τPθ(τ)θlogPθ(τ)G(τ),\nabla_\theta J(\theta) = \sum_\tau P_\theta(\tau) \nabla_\theta\log P_\theta(\tau) G(\tau),

or equivalently,

θJ(θ)=Eτπθ[θlogPθ(τ)G(τ)].\nabla_\theta J(\theta) = \mathbb{E}_{\tau\sim\pi_\theta} \left[ \nabla_\theta\log P_\theta(\tau)G(\tau) \right].

Now use the trajectory probability from the previous section:

logPθ(τ)=logρ0(s0)+tlogπθ(atst)+tlogP(st+1st,at).\log P_\theta(\tau) = \log \rho_0(s_0) + \sum_t\log\pi_\theta(a_t\mid s_t) + \sum_t\log P(s_{t+1}\mid s_t,a_t).

Only the policy terms depend on θ\theta. Therefore,

θlogPθ(τ)=tθlogπθ(atst).\nabla_\theta\log P_\theta(\tau) = \sum_t\nabla_\theta\log\pi_\theta(a_t\mid s_t).

This yields the basic policy gradient form:

θJ(θ)=Eτπθ[tθlogπθ(atst)G(τ)].\nabla_\theta J(\theta) = \mathbb{E}_{\tau\sim\pi_\theta} \left[ \sum_t \nabla_\theta\log\pi_\theta(a_t\mid s_t) G(\tau) \right].

Usually, the action at time tt should be credited only for rewards that occur after it, so we replace the full trajectory return with the return-to-go:

Gt=k=tγktrk.G_t=\sum_{k=t}^{\infty}\gamma^{k-t}r_k.

The more common form is therefore

θJ(θ)=Eτπθ[tθlogπθ(atst)Gt].\nabla_\theta J(\theta) = \mathbb{E}_{\tau\sim\pi_\theta} \left[ \sum_t \nabla_\theta\log\pi_\theta(a_t\mid s_t) G_t \right].

The term θlogπθ(atst)\nabla_\theta\log\pi_\theta(a_t\mid s_t) points in the direction that makes the sampled action more likely. The scalar GtG_t decides how strongly to push in that direction.

The plain-language reading is:

If an action was followed by high return, increase its probability in similar states. If it was followed by low return, decrease it.

A Two-Action Example

Consider a one-state task with two actions, A and B. Let one parameter θ\theta control the probability of A:

p=πθ(A)=σ(θ),πθ(B)=1p.p=\pi_\theta(A)=\sigma(\theta), \qquad \pi_\theta(B)=1-p.

For this parameterization,

θlogπθ(A)=1p,\nabla_\theta\log\pi_\theta(A)=1-p,

and

θlogπθ(B)=p.\nabla_\theta\log\pi_\theta(B)=-p.

Suppose p=0.7p=0.7.

If the agent samples A and later receives G=10G=10, the update direction is proportional to

(1p)G=0.3×10=3.(1-p)G=0.3\times 10=3.

The gradient is positive, so θ\theta increases and A becomes more probable.

If the agent samples B and later receives G=10G=10, the update direction is proportional to

(p)G=0.7×10=7.(-p)G=-0.7\times 10=-7.

The gradient is negative, so θ\theta decreases, which lowers the probability of A and raises the probability of B.

So policy gradients do not blindly reinforce every action. They reinforce the action that was actually sampled, and the direction depends on the return that followed it.

CartPole and REINFORCE

In CartPole, the state is a 4D vector and the action space has two choices: push left or push right.

CartPole-v1 example

REINFORCE applies the policy gradient idea directly:

  1. run one full episode using the current policy πθ\pi_\theta;
  2. record states, actions, log probabilities, and rewards;
  3. compute return-to-go GtG_t for every step;
  4. update parameters with

θθ+αtθlogπθ(atst)Gt.\theta \leftarrow \theta + \alpha \sum_t \nabla_\theta\log\pi_\theta(a_t\mid s_t)G_t.

A minimal PyTorch implementation has the following structure:

python
env = gym.make("CartPole-v1")
policy = PolicyNet()
optimizer = torch.optim.Adam(policy.parameters(), lr=1e-2)
gamma = 0.99

for episode in range(500):
    state, _ = env.reset()
    log_probs = []
    rewards = []
    done = False

    while not done:
        action, log_prob = policy.select_action(state)
        next_state, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated
        log_probs.append(log_prob)
        rewards.append(reward)
        state = next_state

    returns = []
    G = 0
    for r in reversed(rewards):
        G = r + gamma * G
        returns.insert(0, G)

    returns = torch.tensor(returns, dtype=torch.float32)
    returns = (returns - returns.mean()) / (returns.std() + 1e-8)

    loss = -sum(lp * Gt for lp, Gt in zip(log_probs, returns))
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

The negative sign appears because PyTorch optimizers minimize losses, while policy gradient is a gradient-ascent method.

Sparse Rewards: Why REINFORCE Can Struggle

CartPole gives a reward at every step, so the policy receives frequent feedback. Many environments are not like this.

MountainCar is a useful contrast: the car starts in a valley and must build momentum to reach the hilltop.

MountainCar-v0 example

The reward is usually 1-1 per step until success, and a random policy almost never reaches the goal. Then most sampled trajectories look equally bad. If every episode returns roughly 200-200, REINFORCE has little information about which early actions were useful.

This is the sparse-reward problem. TD methods and Q-learning can sometimes propagate value backward from rare successful states more efficiently, while pure Monte Carlo policy gradients must wait for whole-episode returns.

Compared With Q-Learning

The two routes solve different problems well:

AspectQ-LearningREINFORCE
Learnsaction values Q(s,a)Q(s,a)policy parameters θ\theta
Update timingevery stepafter an episode
Data styleoften off-policyon-policy
Action spacesbest for finite discrete actionsworks for discrete or continuous policies
Strengthsample reuse and TD propagationdirect policy optimization
Weaknesshard argmax\arg\max in continuous spaceshigh variance

They are not enemies. Actor-Critic methods combine them: an Actor learns the policy, while a Critic estimates values to stabilize the policy update.

High Variance and Baselines

Policy gradients can be noisy because returns vary across episodes. In CartPole, one rollout might last 190 steps and another 40 steps under nearly the same policy. Multiplying log-probability gradients by raw returns can make updates unstable.

A more precise question is not:

Was the return positive?\text{Was the return positive?}

but:

Was this action better than the usual action in this state?\text{Was this action better than the usual action in this state?}

This leads back to value functions. The state value Vπ(s)V^\pi(s) can serve as a baseline: the expected return from state ss under the current policy. The advantage function is

Aπ(s,a)=Qπ(s,a)Vπ(s).A^\pi(s,a)=Q^\pi(s,a)-V^\pi(s).

If Aπ(s,a)>0A^\pi(s,a)>0, action aa is better than average in state ss and should become more likely. If Aπ(s,a)<0A^\pi(s,a)<0, it is worse than average and should become less likely.

This is the basic motivation for Actor-Critic:

  • the Actor πθ\pi_\theta chooses actions;
  • the Critic Vϕ(s)V_\phi(s) or Qϕ(s,a)Q_\phi(s,a) provides a lower-variance learning signal.

REINFORCE uses sampled returns directly. Actor-Critic replaces raw returns with value or advantage estimates. PPO then adds a constraint that prevents each policy update from moving too far.

Relationship to Neighboring Sections

The previous section introduced value-based RL: learn Q(s,a)Q(s,a), then act using argmaxaQ(s,a)\arg\max_a Q(s,a). This works naturally when the action space is small and enumerable.

This section introduced policy-based RL: learn πθ(as)\pi_\theta(a\mid s) directly, then optimize expected return J(θ)J(\theta). This is natural for continuous control, large discrete spaces, and stochastic policies.

The next section asks where the data comes from. Both value updates and policy gradients require trajectories, transitions, or preference data. Whether those data are on-policy, off-policy, online, or offline changes the algorithmic regime.

Summary

  1. Value-based methods learn Q(s,a)Q(s,a) and choose actions using argmaxaQ(s,a)\arg\max_a Q(s,a).
  2. Policy-based methods learn πθ(as)\pi_\theta(a\mid s) directly.
  3. The policy objective is J(θ)=Eτπθ[G0]J(\theta)=\mathbb{E}_{\tau\sim\pi_\theta}[G_0].
  4. The trajectory probability separates policy terms from environment terms.
  5. The log-derivative trick turns θJ(θ)\nabla_\theta J(\theta) into an expectation that can be estimated from sampled trajectories.
  6. REINFORCE is the most direct policy-gradient algorithm, but it has high variance.
  7. Baselines, advantages, Actor-Critic, and PPO are progressively more stable ways to use the same objective.

Previous: Action-Value Functions

Next: Data Sources

References


  1. Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8, 229-256. ↩︎

现代强化学习实战课程