3.9 Chapter Summary: MDP, Value, and Policy
Overview
This chapter develops the core language of reinforcement learning around sequential decision-making: how we model an environment, how we define return, how we evaluate states and actions via value functions, how Bellman equations provide the recursive structure, how we estimate values from data (DP/MC/TD), how tabular Q-Learning works, how we optimize parameterized policies, and why reward design matters.
As the closing summary of Chapter 3, this section collects the key formulas from Sections 3.1 to 3.8 and explains where each one sits in the chapter's logical structure.
The main takeaways of this chapter can be summarized in eight points:
- A reinforcement learning problem can be formalized as an MDP five-tuple.
- The agent optimizes discounted cumulative return, not a one-step immediate reward.
- The state-value function and action-value function evaluate long-term return at the level of states and actions.
- Bellman equations reveal the recursive structure of value functions.
- DP, MC, and TD are three fundamental classes of value estimation methods.
- A parameterized policy can be optimized directly through a policy objective.
- Algorithms can be categorized by data source into on-policy/off-policy and online/offline.
- The reward function defines the learning problem itself; reward design shapes the final behavior learned by the agent.
These ideas form the shared theoretical foundation for later topics such as Deep Q-Networks, policy gradients, Actor-Critic methods, PPO, and reinforcement learning methods for large language models.
Index Of Core Formulas
Below we list the core formulas from Sections 3.1 to 3.8 in one place. Each formula is annotated with its name, what it is used for, and where it was explained.
3.1 Two Slot Machines
3.2 MDP
3.3 V(s) And The Bellman Equation
3.4 DP, MC, TD
3.5 Q(s, a)
3.5 Q-Learning
3.6 Policy Objective
3.8 Reward Design
Scalar And Matrix Forms
All formulas in this chapter are presented in a per-state (scalar) form. If we stack all states into vectors and write transitions as matrices, the scalar equations can be compressed into a single line of matrix form.
Notation
To avoid overly long dimension expressions, we write for the number of states and for the number of actions.
| Symbol | Shape | Meaning |
|---|---|---|
| values of all states | ||
| expected immediate reward for each state | ||
| policy-induced transition matrix, | ||
| Q values for all pairs | ||
| transition matrix, | ||
| policy matrix, |
Master Comparison Table
Bellman expectation equation
Per-state form:
Matrix form:
Bellman optimality equation
Per-state form:
Matrix form:
Closed-form solution
Matrix form:
V-Q relationship
Per-state form:
Matrix form:
Bellman expectation equation for Q
Per-state form:
Matrix form:
Bellman optimality equation for Q
Per-state form:
Matrix form:
DP policy evaluation
Per-state form:
Matrix form:
MC and TD update individual states from samples, so they do not have a direct matrix-form counterpart here.
From Q To V
Substitute into , and then left-multiply both sides by :
In the matrix view, the Q-form retains the action dimension (the policy averaging is handled separately by ), while the V-form has already absorbed policy averaging into and . This is exactly the matrix-language expression of the statement that "Q carries finer-grained information than V."
Dependency Structure Of The Formulas
The formulas in this chapter are not isolated; they form a layered sequence of definitions and consequences.
| Layer | Core Question | Key Objects |
|---|---|---|
| Problem modeling | What are the environment, actions, feedback, and discount? | |
| Optimization goal | How do we measure long-term return from a time point? | |
| Behavior rule | How does the agent choose actions in a state? | , , |
| State evaluation | What is a state's long-term value? | |
| Recursive form | How does long-term return decompose into reward and value? | Bellman expectation equation; Bellman optimality equation |
| Learning from data | How do we estimate value when the model is unknown? | DP, MC, TD, |
| Action evaluation | After fixing the first action, what is the long-term value? | , |
| Tabular control | How do we learn Q from samples and derive a policy? | Q-Learning, TD target, -greedy |
| Policy optimization | How do we optimize a parameterized policy directly? | , |
| Objective design | What reward signal is the algorithm maximizing? | , , |
This hierarchy reflects the chapter's logic: we define the environment before defining return, and define return before defining value. Value recursion underlies DP/MC/TD; state and action values support policy improvement; and the reward signal ultimately determines what every optimization objective means.
The Main Thread Of The Chapter
From Return To Bellman Recursion
The most important mathematical structure in Chapter 3 is recursion. Discounted return can be written as an infinite sum:
The same quantity can be equivalently written in a one-step recursive form:
This recursion decomposes long-term return into the current immediate reward and the next-step return. Bellman equations lift this trajectory-level recursion to the expected value .
From State Values To Sample-Based Learning
If the environment model and is known, we can update values directly using the Bellman expectation equation (DP). If the model is unknown, we must estimate values from sampled trajectories:
- MC uses the complete return as its target. It is unbiased, but has high variance.
- TD uses the bootstrapped target . It has lower variance and can be updated online.
- The TD error measures the gap between the current estimate and the one-step Bellman target.
This idea becomes the foundation for later techniques such as critics, DQN targets, and GAE.
From State Values To Action Values
evaluates states, but it does not directly tell us how good each action is at state . To capture long-term return at the action level, Section 3.5 introduces the action-value function:
This definition fixes the first action and evaluates the long-term return obtained by following policy thereafter. As a result, contains more direct information for action selection than . When the optimal action-value function is known, an optimal policy can be induced by .
From Action Values To Q-Learning
Section 3.5 applies the TD idea to a table of action values. Each experience tuple yields a one-step TD target:
and uses it to correct the current estimate . This allows the agent to learn a decision-ready Q-table without a full environment model and without waiting for an episode to end. Tabular Q-Learning fits small, discrete state spaces; once the state space is large or continuous, we need function approximation methods, which we discuss in Chapter 4.
From Policy Representation To Policy Optimization
Section 3.6 provides another perspective on learning: instead of explicitly learning values for each action first, we can represent the policy as a parameterized distribution and maximize
The policy gradient expression shows that the update direction has two components. The term describes how to increase the probability of the chosen action, while serves as the return-weight that tells us how strongly to push in that direction. Chapter 5 will derive and refine this result.
The Reward Function Defines The Objective
Every value function, policy objective, and update rule ultimately depends on the accumulation of rewards. Rewards that are too sparse lead to weak learning signals; poorly designed rewards can cause the agent to optimize something misaligned with the task intention. The reward shaping and intrinsic reward ideas in Section 3.8 aim to strengthen learning signals while keeping the original task objective as unchanged as possible.
Review Questions
After completing this chapter, you should be able to answer the following questions.
- Given a task, how do you write its MDP five-tuple?
Reference Answer
Represent the task as . Here is the set of possible states; is the set of available actions; describes the transition dynamics after taking an action; (or ) defines the immediate reward; and is the discount factor for future rewards. When writing an MDP, you should explain what each component means in the concrete task, rather than only listing symbols.
- Why does RL optimize discounted cumulative return rather than only immediate reward?
Reference Answer
Reinforcement learning studies sequential decision-making. An action not only affects the current reward, but also changes future states, which in turn affects future rewards. Therefore, optimizing immediate rewards alone can lead to short-sighted policies. The discounted cumulative return
unifies present and future rewards into a single long-term objective, and uses to control how important the future is. For continuing tasks, also ensures the return remains finite.
- What do , , , and evaluate, respectively?
Reference Answer
is the discounted cumulative return starting from time along a particular trajectory. is the expectation of when starting from state and following policy , and is used to evaluate state value. is the expectation of when first taking action in state and then following policy , and is used to evaluate action value. is the overall expected return of the parameterized policy , and is used to measure and optimize the policy itself.
- What is the difference between the Bellman expectation equation and the Bellman optimality equation?
Reference Answer
The Bellman expectation equation evaluates a given policy , where action selection is averaged using :
The Bellman optimality equation defines the optimal value. It does not fix any particular policy, but instead takes a maximum over actions:
The former answers "what is the value if we act according to this policy?", while the latter answers "what is the highest value achievable under optimal actions?"
- What are the key differences between DP, MC, and TD?
Reference Answer
DP assumes the environment model is known, and updates values by taking expectations over actions and next states. Its errors mainly come from incomplete convergence of the iterative procedure or function approximation errors. MC does not require an environment model, but it must wait until an episode ends to update using the full return ; it targets the true return, so it is unbiased, but it can have high variance. TD does not require an environment model and does not need to wait for an episode to finish; it can update step-by-step using . Its variance is lower, but because it bootstraps from an estimate , it introduces bias.
- Why does TD error become the shared learning signal in later critics, deep Q-networks, and GAE?
Reference Answer
The TD error
measures the gap between the current value estimate and the one-step Bellman target. Critics can use it to update state-value functions; deep Q-networks use the same bootstrapping idea to construct training targets for Q-functions; and GAE forms advantage estimates by taking weighted sums of TD errors across multiple time steps. Therefore, TD error is the basic mechanism that turns Bellman recursion into a learnable, sample-based training signal.
- Why can directly induce action selection?
Reference Answer
represents the long-term expected return after choosing action in state . If the action values are known, we can directly compare across actions at the same state. When the optimal action values are known, the optimal policy can be written as
So the Q-function not only evaluates actions, but also yields an action selection rule by choosing the action with maximal value.
- Why do parameterized policies need an objective function ?
Reference Answer
For a parameterized policy , the object we learn is the parameter vector . To optimize , we need an objective function with as the variable:
measures the policy's expected long-term return. Policy gradient methods estimate and adjust parameters to increase the probability of actions in high-return trajectories, thereby improving the policy.
- Why can reward shaping accelerate learning, and why can it also cause objective drift?
Reference Answer
Reward shaping accelerates learning by adding intermediate rewards, so the agent receives learning signals even before reaching the final goal, mitigating the difficulty of sparse rewards. For example, giving additional reward when the agent gets closer to the goal can guide exploration. However, if the shaping reward is poorly designed, the agent may optimize the shaping signal rather than the original task objective, causing objective drift. Potential-based shaping
can theoretically preserve the optimal policy, and is therefore a relatively safer form of shaping.
These questions emphasize the conceptual roles behind the formulas. Mastery of this chapter is not just memorizing symbolic forms, but understanding what each object does in a reinforcement learning problem.
How Later Chapters Use These Ideas
| Later Chapter | Objects From This Chapter | How They Are Used |
|---|---|---|
| Chapter 4: Deep Q-Networks | , , , TD target | Approximate action values with neural networks; update Q via bootstrapped targets |
| Chapter 5: Policy Gradient | , , , | Directly optimize a parameterized policy; raise probability of high-return actions |
| Chapter 6: Actor-Critic | , TD error, | Use a value function as a critic to provide low-variance signals for policy updates |
| Chapter 7: PPO | , advantage function, TD error, policy objective | Use a critic to estimate advantages and constrain the update size |
| Chapters 8+ (LLM RL) | policy, reward, return, objective | Treat token generation as sequential decisions; convert preference or verification signals into optimization objectives |
So, the formulas in Chapter 3 are not only for this chapter's exercises. They reappear repeatedly in later algorithms. As representations shift from tables to function approximation, these objects re-emerge as neural networks, loss functions, training targets, advantage estimates, and KL constraints.
Summary
Chapter 3 establishes the basic structure of reinforcement learning theory:
- Define a sequential decision problem with the MDP five-tuple.
- Define the long-term objective using discounted cumulative return .
- Evaluate states and actions via and .
- Reveal the recursive structure of value via Bellman equations.
- Explain how value can be computed or estimated using DP, MC, and TD.
- Use to cast parameterized policy learning as an optimization problem.
- Distinguish algorithm families by how data is collected (on/off-policy, online/offline).
- Use reward design to explain where the objective comes from and why the objective definition itself shapes learning outcomes.
The next chapter starts from and introduces the first complete algorithm family: Chapter 4: Deep Q-Networks.