Chapter 3: MDP, Value Functions, and Policy Optimization
Chapter Overview
Core Content
- Master the unified language of the MDP tuple, discounted cumulative return, value functions, and the Bellman equation.
- Understand how DP, MC, and TD estimate value functions under different assumptions.
- Distinguish the route from the route, and understand how the reward function determines the optimization objective.
Core Formulas
Role of This Chapter's Formulas
Chapter 3 establishes a unified formulation of reinforcement learning through a set of foundational formulas. The MDP tuple is used to characterize the sequential decision-making environment the agent operates in; the discounted cumulative return defines the long-term optimization objective; the state value function and action value function evaluate the long-term return of states and actions; the policy objective formulates the optimization problem for parameterized policies. Subsequent topics — DQN, policy gradient, Actor-Critic, and PPO — all build on these basic objects.
In Chapter 1, we trained an agent to balance a pole on the CartPole inverted pendulum — a typical classical control task where states and actions are low-dimensional vectors and rewards are directly given by physics rules. In Chapter 2, we turned to language model preference alignment, using DPO to teach a large language model to distinguish good responses from bad ones, without needing to manually write a reward function. These two chapters approached from very different scenarios, demonstrating the basic usage of reinforcement learning.
The experience from the first two chapters let us run the code, but left some unresolved questions: What exactly is the reward in CartPole optimizing? Why does the preference learning behind DPO work? To answer these questions, we need to go deeper from "how to use" to "why" — that is the goal of this chapter.
Back to the most fundamental question: What does reinforcement learning study? The answer is sequential decision-making — the agent chooses an action at each step, the environment provides feedback and transitions to the next state, and so on. The key is that the agent pursues not the immediate reward at any single step, but the cumulative return over the entire process. Greedily taking the largest immediate reward is often not the optimal strategy.
To formally describe this process requires a unified formal framework — the Markov Decision Process (MDP). The MDP packages states, actions, transition probabilities, reward functions, and discount factors into a tuple . With it, value functions, the Bellman equation, Q-Learning, policy gradient, PPO — these seemingly diverse algorithms — all share a common language.
After defining the problem, the next step is to define a measure of "goodness." The discounted cumulative return describes the total reward obtainable from a trajectory starting at a given time, while the value function distributes the expectation of this reward to specific states or actions. Building on this, the Bellman equation reveals a key recursive structure: the value of a state equals "the immediate reward for the current step" plus "the discounted value of the next state." This seemingly simple equation is the common starting point for dynamic programming, Monte Carlo methods, and temporal difference methods.
Following the value function further naturally splits into two algorithmic routes:
- Value-based route: Learn , score each action, then choose the action with the highest score — leading to Q-Learning and Deep Q-Networks.
- Policy-based route: Directly define a policy objective and optimize the policy parameters through gradient methods — leading to policy gradient, Actor-Critic, and PPO.
This chapter serves as the theoretical foundation for the entire book. Chapter 4's Deep Q-Networks depend on , Chapter 5's policy gradient depends on , and Chapter 6's Actor-Critic and PPO use both value estimation and policy optimization ideas. After understanding this chapter, many formulas in subsequent algorithms will no longer appear as isolated techniques, but as natural consequences derived from the same decision modeling framework.
Section Outline
| Section | Core Content |
|---|---|
| Two-Armed Bandit | Understand exploration, exploitation, and expected return from the smallest decision problem |
| Markov Decision Process | Define the MDP tuple, discounted return, and policy |
| Value Functions and the Bellman Equation | Introduce the state value function and derive its recursive structure |
| DP, MC, and TD | Compare three value estimation methods by assumptions, data requirements, and update rules |
| From Q to Q-Learning | Use GridWorld to illustrate action value, TD targets, exploration, and tabular boundaries |
| From Value to Policy | Define the objective function from the perspective of directly optimizing the policy |
| Where Does Data Come From | Discuss On-policy vs. Off-policy, Online vs. Offline |
| Reward Function Design | Discuss how rewards express task objectives and the problems that incorrect rewards can cause |
| Chapter Summary | Summarize core formulas, algorithmic routes, and connections to subsequent chapters |
Learning Objectives
After reading this chapter, you should be able to:
- Formally describe a reinforcement learning problem using the MDP tuple ;
- Understand how the Bellman equation writes long-term return as a recursive structure, and grasp the core differences among dynamic programming, Monte Carlo, and temporal difference methods for value estimation;
- Articulate the distinction between two algorithmic routes — the route scores actions and selects the best, while the route directly optimizes policy parameters — leading respectively to Deep Q-Networks and PPO.
It is recommended to complete the multi-armed bandit experiment before entering the formal definition of MDP. This way, you can first observe the exploration-exploitation tradeoff in a sufficiently simple environment, and then elevate the intuition into general mathematical language. The next section starts from the smallest reinforcement learning problem: Hands-on: Exploration and Exploitation — Multi-Armed Bandit.