3.9 Chapter Summary: MDP, Value, and Policy

Overview

This chapter develops the core language of reinforcement learning around sequential decision-making: how we model an environment, how we define return, how we evaluate states and actions via value functions, how Bellman equations provide the recursive structure, how we estimate values from data (DP/MC/TD), how tabular Q-Learning works, how we optimize parameterized policies, and why reward design matters.

As the closing summary of Chapter 3, this section collects the key formulas from Sections 3.1 to 3.8 and explains where each one sits in the chapter's logical structure.

The main takeaways of this chapter can be summarized in eight points:

A reinforcement learning problem can be formalized as an MDP five-tuple.
The agent optimizes discounted cumulative return, not a one-step immediate reward.
The state-value function and action-value function evaluate long-term return at the level of states and actions.
Bellman equations reveal the recursive structure of value functions.
DP, MC, and TD are three fundamental classes of value estimation methods.
A parameterized policy can be optimized directly through a policy objective.
Algorithms can be categorized by data source into on-policy/off-policy and online/offline.
The reward function defines the learning problem itself; reward design shapes the final behavior learned by the agent.

These ideas form the shared theoretical foundation for later topics such as Deep Q-Networks, policy gradients, Actor-Critic methods, PPO, and reinforcement learning methods for large language models.

Index Of Core Formulas

Below we list the core formulas from Sections 3.1 to 3.8 in one place. Each formula is annotated with its name, what it is used for, and where it was explained.

3.1 Two Slot Machines

$E [R_{a}] = p_{a} \cdot (+ 1) + (1 - p_{a}) \cdot (- 1) = 2 p_{a} - 1 (expected reward of a single arm; role: compare the average payoff of one action; see 3.1)$

$E [R_{T}] = E [R_{a_{1}}] + E [R_{a_{2}}] + \dots + E [R_{a_{T}}] = t = 1 \sum T E [R_{a_{t}}] (expected total return over T rounds; role: measure the cumulative performance of a whole strategy; see 3.1)$

$Regret (T) = T μ^{*} - t = 1 \sum T μ_{a_{t}}, μ^{*} = a max μ_{a} (regret; role: quantify how much is lost due to exploration compared with the best arm; see 3.1)$

3.2 MDP

$M = ⟨ S, A, P, R, γ ⟩ (MDP five-tuple; role: specify the complete rules of a sequential decision problem; see 3.2)$

$P (s^{'} ∣ s, a), R (s, a), γ \in [0, 1] (transition, reward, and discount; role: define dynamics, immediate feedback, and the weight on the future; see 3.2)$

$G_{t} = k = 0 \sum \infty γ^{k} r_{t + k} = r_{t} + γ G_{t + 1} (discounted cumulative return; role: define the long-term objective from time t; see 3.2)$

$a = π (s), π (a ∣ s) = P (a ∣ s) (deterministic and stochastic policies; role: describe how the agent chooses actions; see 3.2)$

3.3 V(s) And The Bellman Equation

$V^{π} (s) = E_{π} [k = 0 \sum \infty γ^{k} r_{t + k} ∣ s_{t} = s] (state-value function; role: evaluate the long-term expected return of a state; see 3.3)$

$V^{π} (s) = a \in A \sum π (a ∣ s) [R (s, a) + γ s^{'} \in S \sum P (s^{'} ∣ s, a) V^{π} (s^{'})] (Bellman expectation equation; role: recursively compute value under a fixed policy; see 3.3)$

$V^{*} (s) = a max [R (s, a) + γ s^{'} \in S \sum P (s^{'} ∣ s, a) V^{*} (s^{'})] (Bellman optimality equation; role: define the optimal state value; see 3.3)$

$Target = r + γ V (s^{'}), δ = Target - V (s) (Bellman target and the prototype of TD error; role: turn Bellman recursion into a sample-based learning signal; see 3.3)$

3.4 DP, MC, TD

$V (s) \leftarrow a \sum π (a ∣ s) [R (s, a) + γ s^{'} \sum P (s^{'} ∣ s, a) V (s^{'})] (DP policy evaluation update; role: iterate values when the model is known; see 3.4)$

$π^{'} (s) = ar g a max [R (s, a) + γ s^{'} \sum P (s^{'} ∣ s, a) V^{π} (s^{'})] (policy improvement; role: build a better greedy policy from the current value; see 3.4)$

$V (s) \leftarrow V (s) + α [G_{t} - V (s)] (MC value update; role: correct value estimates using complete returns; see 3.4)$

$V (s) \leftarrow V (s) + α [r + γ V (s^{'}) - V (s)] (TD(0) value update; role: bootstrap from one-step targets for online updates; see 3.4)$

$δ = r + γ V (s^{'}) - V (s) (TD error; role: measure how much the current estimate violates the one-step Bellman relation; see 3.4)$

3.5 Q(s, a)

$Q^{π} (s, a) = E_{π} [G_{t} ∣ s_{t} = s, a_{t} = a] (action-value function; role: evaluate long-term return starting with action a at state s; see 3.5)$

$V^{π} (s) = a \sum π (a ∣ s) Q^{π} (s, a) (V-Q relationship; role: obtain state value as the policy-weighted average of action values; see 3.5)$

$Q^{π} (s, a) = R (s, a) + γ s^{'} \in S \sum P (s^{'} ∣ s, a) a^{'} \in A \sum π (a^{'} ∣ s^{'}) Q^{π} (s^{'}, a^{'}) (Bellman expectation equation for Q; role: recursively compute action values under a fixed policy; see 3.5)$

$Q^{*} (s, a) = R (s, a) + γ s^{'} \in S \sum P (s^{'} ∣ s, a) a^{'} max Q^{*} (s^{'}, a^{'}) (Bellman optimality equation for Q; role: recursively define the optimal action value; see 3.5)$

$π^{*} (s) = ar g a max Q^{*} (s, a) (greedy optimal policy; role: induce an optimal policy from the optimal action-value function; see 3.5)$

3.5 Q-Learning

$TD Target = r + γ a^{'} max Q (s^{'}, a^{'}) (Q-Learning TD target; role: construct a one-step learning target for Q from experience; see 3.5)$

$δ = r + γ a^{'} max Q (s^{'}, a^{'}) - Q (s, a) (Q-Learning TD error; role: measure the gap between current Q estimate and the TD target; see 3.5)$

$Q (s, a) \leftarrow Q (s, a) + α [r + γ a^{'} max Q (s^{'}, a^{'}) - Q (s, a)] (Q-Learning update; role: incrementally correct the state-action value table; see 3.5)$

$a_{t} = {random action ar g max_{a} Q (s_{t}, a) with probability ε with probability 1 - ε (ε -greedy; role: trade off exploration and exploitation; see 3.5)$

3.6 Policy Objective

$π_{θ} (a ∣ s) = P_{θ} (a ∣ s) (parameterized stochastic policy; role: represent an action distribution with parameters theta; see 3.6)$

$J (θ) = E_{π_{θ}} [G_{t}] = E_{π_{θ}} [t = 0 \sum \infty γ^{t} r_{t}] (policy objective; role: measure the expected long-term return of a parameterized policy; see 3.6)$

$θ^{*} = ar g θ max J (θ) (optimal policy parameters; role: pose policy learning as a maximization problem; see 3.6)$

$\nabla_{θ} J (θ) \propto E_{π_{θ}} [\nabla_{θ} lo g π_{θ} (a ∣ s) \cdot G_{t}] (policy gradient estimator; role: increase the probability of actions that lead to high return; see 3.6)$

3.8 Reward Design

$R (s, a) = ⎩ ⎨ ⎧ + 1 0 - 1 reach the goal otherwise failure (sparse reward; role: provide learning signal only at success/failure; see 3.8)$

$R_{shaping} (s, a, s^{'}) = - (dist (s^{'}, goal) - dist (s, goal)) (distance-based shaping; role: provide intermediate rewards from progress toward the goal; see 3.8)$

$F (s, a, s^{'}) = γ Φ (s^{'}) - Φ (s) (potential-based shaping; role: strengthen intermediate signals without changing the optimal policy; see 3.8)$

$r_{t}^{intrinsic} = ∥ f (s_{t}, a_{t}) - s_{t + 1} ∥^{2} (prediction-error intrinsic reward; role: encourage exploration where the model predicts poorly; see 3.8)$

$r_{t}^{RND} = \hat{ϕ} (s_{t}) - ϕ (s_{t})^{2} (RND intrinsic reward; role: measure novelty via random network distillation; see 3.8)$

$r_{t}^{total} = r_{t}^{extrinsic} + β r_{t}^{intrinsic} (total reward combination; role: combine task reward and exploration reward; see 3.8)$

Scalar And Matrix Forms

All formulas in this chapter are presented in a per-state (scalar) form. If we stack all states into vectors and write transitions as matrices, the $n$ scalar equations can be compressed into a single line of matrix form.

Notation

To avoid overly long dimension expressions, we write $n = ∣ S ∣$ for the number of states and $n_{A} = ∣ A ∣$ for the number of actions.

Symbol	Shape	Meaning
$v_{π}$	$n \times 1$	values of all states
$r_{π}$	$n \times 1$	expected immediate reward for each state
$P_{π}$	$n \times n$	policy-induced transition matrix, $P_{π} [i, j] = \sum_{a} π (a ∣ s_{i}) p (s_{j} ∣ s_{i}, a)$
$q_{π}$	$n n_{A} \times 1$	Q values for all $(s, a)$ pairs
$P$	$n n_{A} \times n$	transition matrix, $P [(s, a), s^{'}] = P (s^{'} ∣ s, a)$
$Π_{π}$	$n \times n n_{A}$	policy matrix, $Π_{π} [s, (s, a)] = π (a ∣ s)$

Master Comparison Table

Bellman expectation equation

Per-state form:

$V^{π} (s) = a \sum π (a ∣ s) [R (s, a) + γ s^{'} \sum P (s^{'} ∣ s, a) V^{π} (s^{'})]$

Matrix form:

$v_{π} = r_{π} + γ P_{π} v_{π}$

Bellman optimality equation

Per-state form:

$V^{*} (s) = a max [R (s, a) + γ s^{'} \sum P (s^{'} ∣ s, a) V^{*} (s^{'})]$

Matrix form:

$v_{*} = r_{*} + γ P_{*} v_{*} (row-wise max)$

Closed-form solution

Matrix form:

$v = (I - γ P)^{- 1} r$

V-Q relationship

Per-state form:

$V^{π} (s) = a \sum π (a ∣ s) Q^{π} (s, a)$

Matrix form:

$v_{π} = Π_{π} q_{π}$

Bellman expectation equation for Q

Per-state form:

$Q^{π} (s, a) = R (s, a) + γ s^{'} \sum P (s^{'} ∣ s, a) a^{'} \sum π (a^{'} ∣ s^{'}) Q^{π} (s^{'}, a^{'})$

Matrix form:

$q_{π} = r + γ P Π_{π} q_{π}$

Bellman optimality equation for Q

Per-state form:

$Q^{*} (s, a) = R (s, a) + γ s^{'} \sum P (s^{'} ∣ s, a) a^{'} max Q^{*} (s^{'}, a^{'})$

Matrix form:

$q_{*} = r + γ P \cdot rowmax (q_{*})$

DP policy evaluation

Per-state form:

$V (s) \leftarrow a \sum π (a ∣ s) [R (s, a) + γ s^{'} \sum P (s^{'} ∣ s, a) V (s^{'})]$

Matrix form:

$v_{k + 1} = r_{π} + γ P_{π} v_{k}$

MC and TD update individual states from samples, so they do not have a direct matrix-form counterpart here.

From Q To V

Substitute $v_{π} = Π_{π} q_{π}$ into $q_{π} = r + γ P v_{π}$ , and then left-multiply both sides by $Π_{π}$ :

$Π_{π} q_{π} = Π_{π} r + γ Π_{π} P v_{π} ⟹ v_{π} = r_{π} Π_{π} r + γ P_{π} Π_{π} P v_{π}$

In the matrix view, the Q-form retains the action dimension (the policy averaging is handled separately by $Π_{π}$ ), while the V-form has already absorbed policy averaging into $r_{π}$ and $P_{π}$ . This is exactly the matrix-language expression of the statement that "Q carries finer-grained information than V."

Dependency Structure Of The Formulas

The formulas in this chapter are not isolated; they form a layered sequence of definitions and consequences.

Layer	Core Question	Key Objects
Problem modeling	What are the environment, actions, feedback, and discount?	$M = ⟨ S, A, P, R, γ ⟩$
Optimization goal	How do we measure long-term return from a time point?	$G_{t} = \sum_{k = 0}^{\infty} γ^{k} r_{t + k}$
Behavior rule	How does the agent choose actions in a state?	$π (s)$ , $π (a ∣ s)$ , $π_{θ} (a ∣ s)$
State evaluation	What is a state's long-term value?	$V^{π} (s) = E_{π} [G_{t} ∣ s_{t} = s]$
Recursive form	How does long-term return decompose into reward and value?	Bellman expectation equation; Bellman optimality equation
Learning from data	How do we estimate value when the model is unknown?	DP, MC, TD, $δ$
Action evaluation	After fixing the first action, what is the long-term value?	$Q^{π} (s, a)$ , $Q^{*} (s, a)$
Tabular control	How do we learn Q from samples and derive a policy?	Q-Learning, TD target, $ε$ -greedy
Policy optimization	How do we optimize a parameterized policy directly?	$J (θ)$ , $\nabla_{θ} J (θ)$
Objective design	What reward signal is the algorithm maximizing?	$R (s, a)$ , $F (s, a, s^{'})$ , $r_{t}^{total}$

This hierarchy reflects the chapter's logic: we define the environment before defining return, and define return before defining value. Value recursion underlies DP/MC/TD; state and action values support policy improvement; and the reward signal ultimately determines what every optimization objective means.

The Main Thread Of The Chapter

From Return To Bellman Recursion

The most important mathematical structure in Chapter 3 is recursion. Discounted return can be written as an infinite sum:

$G_{t} = k = 0 \sum \infty γ^{k} r_{t + k}$

The same quantity can be equivalently written in a one-step recursive form:

$G_{t} = r_{t} + γ G_{t + 1}$

This recursion decomposes long-term return into the current immediate reward and the next-step return. Bellman equations lift this trajectory-level recursion to the expected value $V^{π} (s)$ .

From State Values To Sample-Based Learning

If the environment model $P$ and $R$ is known, we can update values directly using the Bellman expectation equation (DP). If the model is unknown, we must estimate values from sampled trajectories:

MC uses the complete return $G_{t}$ as its target. It is unbiased, but has high variance.
TD uses the bootstrapped target $r + γ V (s^{'})$ . It has lower variance and can be updated online.
The TD error $δ = r + γ V (s^{'}) - V (s)$ measures the gap between the current estimate and the one-step Bellman target.

This idea becomes the foundation for later techniques such as critics, DQN targets, and GAE.

From State Values To Action Values

$V^{π} (s)$ evaluates states, but it does not directly tell us how good each action is at state $s$ . To capture long-term return at the action level, Section 3.5 introduces the action-value function:

$Q^{π} (s, a) = E_{π} [G_{t} ∣ s_{t} = s, a_{t} = a]$

This definition fixes the first action and evaluates the long-term return obtained by following policy $π$ thereafter. As a result, $Q$ contains more direct information for action selection than $V$ . When the optimal action-value function $Q^{*} (s, a)$ is known, an optimal policy can be induced by $ar g max_{a} Q^{*} (s, a)$ .

From Action Values To Q-Learning

Section 3.5 applies the TD idea to a table of action values. Each experience tuple $(s, a, r, s^{'})$ yields a one-step TD target:

$r + γ a^{'} max Q (s^{'}, a^{'})$

and uses it to correct the current estimate $Q (s, a)$ . This allows the agent to learn a decision-ready Q-table without a full environment model and without waiting for an episode to end. Tabular Q-Learning fits small, discrete state spaces; once the state space is large or continuous, we need function approximation methods, which we discuss in Chapter 4.

From Policy Representation To Policy Optimization

Section 3.6 provides another perspective on learning: instead of explicitly learning values for each action first, we can represent the policy as a parameterized distribution $π_{θ} (a ∣ s)$ and maximize

$J (θ) = E_{π_{θ}} [G_{t}]$

The policy gradient expression shows that the update direction has two components. The term $\nabla_{θ} lo g π_{θ} (a ∣ s)$ describes how to increase the probability of the chosen action, while $G_{t}$ serves as the return-weight that tells us how strongly to push in that direction. Chapter 5 will derive and refine this result.

The Reward Function Defines The Objective

Every value function, policy objective, and update rule ultimately depends on the accumulation of rewards. Rewards that are too sparse lead to weak learning signals; poorly designed rewards can cause the agent to optimize something misaligned with the task intention. The reward shaping and intrinsic reward ideas in Section 3.8 aim to strengthen learning signals while keeping the original task objective as unchanged as possible.

Review Questions

After completing this chapter, you should be able to answer the following questions.

Given a task, how do you write its MDP five-tuple?

Reference Answer

Represent the task as $M = ⟨ S, A, P, R, γ ⟩$ . Here $S$ is the set of possible states; $A$ is the set of available actions; $P (s^{'} ∣ s, a)$ describes the transition dynamics after taking an action; $R (s, a)$ (or $R (s, a, s^{'})$ ) defines the immediate reward; and $γ$ is the discount factor for future rewards. When writing an MDP, you should explain what each component means in the concrete task, rather than only listing symbols.

Why does RL optimize discounted cumulative return rather than only immediate reward?

Reference Answer

Reinforcement learning studies sequential decision-making. An action not only affects the current reward, but also changes future states, which in turn affects future rewards. Therefore, optimizing immediate rewards alone can lead to short-sighted policies. The discounted cumulative return

$G_{t} = k = 0 \sum \infty γ^{k} r_{t + k}$

unifies present and future rewards into a single long-term objective, and uses $γ$ to control how important the future is. For continuing tasks, $γ < 1$ also ensures the return remains finite.

What do $G_{t}$ , $V^{π} (s)$ , $Q^{π} (s, a)$ , and $J (θ)$ evaluate, respectively?

Reference Answer

$G_{t}$ is the discounted cumulative return starting from time $t$ along a particular trajectory. $V^{π} (s)$ is the expectation of $G_{t}$ when starting from state $s$ and following policy $π$ , and is used to evaluate state value. $Q^{π} (s, a)$ is the expectation of $G_{t}$ when first taking action $a$ in state $s$ and then following policy $π$ , and is used to evaluate action value. $J (θ)$ is the overall expected return of the parameterized policy $π_{θ}$ , and is used to measure and optimize the policy itself.

What is the difference between the Bellman expectation equation and the Bellman optimality equation?

Reference Answer

The Bellman expectation equation evaluates a given policy $π$ , where action selection is averaged using $π (a ∣ s)$ :

$V^{π} (s) = a \sum π (a ∣ s) [R (s, a) + γ s^{'} \sum P (s^{'} ∣ s, a) V^{π} (s^{'})] .$

The Bellman optimality equation defines the optimal value. It does not fix any particular policy, but instead takes a maximum over actions:

$V^{*} (s) = a max [R (s, a) + γ s^{'} \sum P (s^{'} ∣ s, a) V^{*} (s^{'})] .$

The former answers "what is the value if we act according to this policy?", while the latter answers "what is the highest value achievable under optimal actions?"

What are the key differences between DP, MC, and TD?

Reference Answer

DP assumes the environment model is known, and updates values by taking expectations over actions and next states. Its errors mainly come from incomplete convergence of the iterative procedure or function approximation errors. MC does not require an environment model, but it must wait until an episode ends to update using the full return $G_{t}$ ; it targets the true return, so it is unbiased, but it can have high variance. TD does not require an environment model and does not need to wait for an episode to finish; it can update step-by-step using $r + γ V (s^{'})$ . Its variance is lower, but because it bootstraps from an estimate $V (s^{'})$ , it introduces bias.

Why does TD error become the shared learning signal in later critics, deep Q-networks, and GAE?

Reference Answer

The TD error

$δ = r + γ V (s^{'}) - V (s)$

measures the gap between the current value estimate and the one-step Bellman target. Critics can use it to update state-value functions; deep Q-networks use the same bootstrapping idea to construct training targets for Q-functions; and GAE forms advantage estimates by taking weighted sums of TD errors across multiple time steps. Therefore, TD error is the basic mechanism that turns Bellman recursion into a learnable, sample-based training signal.

Why can $Q (s, a)$ directly induce action selection?

Reference Answer

$Q (s, a)$ represents the long-term expected return after choosing action $a$ in state $s$ . If the action values are known, we can directly compare $Q$ across actions at the same state. When the optimal action values $Q^{*} (s, a)$ are known, the optimal policy can be written as

$π^{*} (s) = ar g a max Q^{*} (s, a) .$

So the Q-function not only evaluates actions, but also yields an action selection rule by choosing the action with maximal value.

Why do parameterized policies need an objective function $J (θ)$ ?

Reference Answer

For a parameterized policy $π_{θ} (a ∣ s)$ , the object we learn is the parameter vector $θ$ . To optimize $θ$ , we need an objective function with $θ$ as the variable:

$J (θ) = E_{π_{θ}} [G_{t}] .$

$J (θ)$ measures the policy's expected long-term return. Policy gradient methods estimate $\nabla_{θ} J (θ)$ and adjust parameters to increase the probability of actions in high-return trajectories, thereby improving the policy.

Why can reward shaping accelerate learning, and why can it also cause objective drift?

Reference Answer

Reward shaping accelerates learning by adding intermediate rewards, so the agent receives learning signals even before reaching the final goal, mitigating the difficulty of sparse rewards. For example, giving additional reward when the agent gets closer to the goal can guide exploration. However, if the shaping reward is poorly designed, the agent may optimize the shaping signal rather than the original task objective, causing objective drift. Potential-based shaping

$F (s, a, s^{'}) = γ Φ (s^{'}) - Φ (s)$

can theoretically preserve the optimal policy, and is therefore a relatively safer form of shaping.

These questions emphasize the conceptual roles behind the formulas. Mastery of this chapter is not just memorizing symbolic forms, but understanding what each object does in a reinforcement learning problem.

How Later Chapters Use These Ideas

Later Chapter	Objects From This Chapter	How They Are Used
Chapter 4: Deep Q-Networks	$Q (s, a)$ , $Q^{*} (s, a)$ , $ar g max_{a} Q (s, a)$ , TD target	Approximate action values with neural networks; update Q via bootstrapped targets
Chapter 5: Policy Gradient	$π_{θ} (a ∣ s)$ , $J (θ)$ , $\nabla_{θ} J (θ)$ , $G_{t}$	Directly optimize a parameterized policy; raise probability of high-return actions
Chapter 6: Actor-Critic	$V (s)$ , TD error, $J (θ)$	Use a value function as a critic to provide low-variance signals for policy updates
Chapter 7: PPO	$V (s)$ , advantage function, TD error, policy objective	Use a critic to estimate advantages and constrain the update size
Chapters 8+ (LLM RL)	policy, reward, return, objective	Treat token generation as sequential decisions; convert preference or verification signals into optimization objectives

So, the formulas in Chapter 3 are not only for this chapter's exercises. They reappear repeatedly in later algorithms. As representations shift from tables to function approximation, these objects re-emerge as neural networks, loss functions, training targets, advantage estimates, and KL constraints.

Summary

Chapter 3 establishes the basic structure of reinforcement learning theory:

Define a sequential decision problem with the MDP five-tuple.
Define the long-term objective using discounted cumulative return $G_{t}$ .
Evaluate states and actions via $V^{π} (s)$ and $Q^{π} (s, a)$ .
Reveal the recursive structure of value via Bellman equations.
Explain how value can be computed or estimated using DP, MC, and TD.
Use $J (θ)$ to cast parameterized policy learning as an optimization problem.
Distinguish algorithm families by how data is collected (on/off-policy, online/offline).
Use reward design to explain where the objective comes from and why the objective definition itself shapes learning outcomes.

The next chapter starts from $Q (s, a)$ and introduces the first complete algorithm family: Chapter 4: Deep Q-Networks.

D.1 Linear Algebra

D.2 Probability & Estimation

D.3 Calculus & Optimization

D.4 Information Theory

3.9 Chapter Summary: MDP, Value, and Policy

Overview

Index Of Core Formulas

3.1 Two Slot Machines

3.2 MDP

3.3 V(s) And The Bellman Equation

3.4 DP, MC, TD

3.5 Q(s, a)

3.5 Q-Learning

3.6 Policy Objective

3.8 Reward Design

Scalar And Matrix Forms

Notation

Master Comparison Table

From Q To V

Dependency Structure Of The Formulas

The Main Thread Of The Chapter

From Return To Bellman Recursion

From State Values To Sample-Based Learning

From State Values To Action Values

From Action Values To Q-Learning

From Policy Representation To Policy Optimization

The Reward Function Defines The Objective

Review Questions

How Later Chapters Use These Ideas

Summary

3.9 Chapter Summary: MDP, Value, and Policy ​

Overview ​

Index Of Core Formulas ​

3.1 Two Slot Machines ​

3.2 MDP ​

3.3 V(s) And The Bellman Equation ​

3.4 DP, MC, TD ​

3.5 Q(s, a) ​

3.5 Q-Learning ​

3.6 Policy Objective ​

3.8 Reward Design ​

Scalar And Matrix Forms ​

Notation ​

Master Comparison Table ​

From Q To V ​

Dependency Structure Of The Formulas ​

The Main Thread Of The Chapter ​

From Return To Bellman Recursion ​

From State Values To Sample-Based Learning ​

From State Values To Action Values ​

From Action Values To Q-Learning ​

From Policy Representation To Policy Optimization ​

The Reward Function Defines The Objective ​

Review Questions ​

How Later Chapters Use These Ideas ​

Summary ​

3.9 Chapter Summary: MDP, Value, and Policy

Overview

Index Of Core Formulas

3.1 Two Slot Machines

3.2 MDP

3.3 V(s) And The Bellman Equation

3.4 DP, MC, TD

3.5 Q(s, a)

3.5 Q-Learning

3.6 Policy Objective

3.8 Reward Design

Scalar And Matrix Forms

Notation

Master Comparison Table

From Q To V

Dependency Structure Of The Formulas

The Main Thread Of The Chapter

From Return To Bellman Recursion

From State Values To Sample-Based Learning

From State Values To Action Values

From Action Values To Q-Learning

From Policy Representation To Policy Optimization

The Reward Function Defines The Objective

Review Questions

How Later Chapters Use These Ideas

Summary