4.4 DQN Improvement Family
In the previous section, we trained DQN on LunarLander-v3 and saw a complete learning curve.
That experiment makes two points clear:
- replay buffers and target networks do make neural-network Q-Learning trainable
- even on a low-dimensional control task, curves can be noisy and failures still happen
This is not an accident. Classic DQN combines Q-Learning, neural networks, replay, and target networks, but it still has structural issues:
- the max operator in the TD target tends to select overestimated actions
- outputting directly can make it harder to exploit state-only information
- uniform sampling wastes updates on uninformative transitions
This section introduces common improvements after DQN. These methods do not change the goal "learn an action-value function". Instead, they modify one of:
- the TD target
- the network structure
- replay sampling
- exploration mechanism
Understanding them helps you diagnose an unstable DQN run: is the issue value overestimation, function structure, data utilization, or exploration?
Double DQN
Recall the vanilla DQN TD target for a transition :
The plays two roles at once:
- select the best next action
- evaluate that action using the same noisy estimates
If each action-value estimate has noise, the max operator is more likely to pick an action whose estimate has been pushed up by noise. Even if each estimate is unbiased on average, the max of noisy estimates becomes positively biased:
Double DQN separates "selection" from "evaluation".
First, pick the best action using the current (online) network:
Then evaluate that action using the target network:
In code, this change is small:
with torch.no_grad():
best_actions = q_net(next_states).argmax(dim=1)
next_q = target_net(next_states)
next_q_selected = next_q.gather(1, best_actions[:, None]).squeeze(1)
target = rewards + gamma * (1 - dones) * next_q_selectedDouble DQN matters because it reduces a common bias without changing the basic algorithm structure. In practice, many implementations treat Double DQN as the default DQN variant.
Dueling DQN
Vanilla DQN predicts action values directly. This is natural when the action set is small, but it does not explicitly separate two questions:
- is the state itself good or bad?
- how do actions differ relative to that state?
Recall the relationship among state value, action value, and advantage:
This decomposition is not unique: adding a constant to and subtracting it from all leaves unchanged. Dueling DQN fixes this by forcing the mean advantage to be zero:
Architecturally, the network shares a feature extractor, then splits into two heads:
- a value head producing scalar
- an advantage head producing vector
Then it recombines them:
features = backbone(states)
values = value_head(features) # [B, 1]
advantages = advantage_head(features) # [B, A]
q_values = values + advantages - advantages.mean(dim=1, keepdim=True)This structure is especially useful when many states have actions with similar outcomes in the short term. For instance, far from the ground in LunarLander, the state (height, speed, angle) already suggests whether the situation is safe or dangerous, even if multiple actions look similar. Dueling can learn that state value earlier.
Prioritized Experience Replay (PER)
Standard replay samples uniformly from the buffer. If the buffer has transitions, each has probability .
Uniform sampling is simple and breaks correlation, but it ignores the fact that some transitions teach more than others.
A natural signal is the TD error:
If is large, the model is currently wrong on that transition. PER assigns priority:
where prevents zero probability. Sampling probability becomes:
controls how strongly we prioritize. reduces back to uniform sampling. makes sampling proportional to priority.
Non-uniform sampling changes the training distribution. PER therefore uses importance sampling weights:
often normalized by the maximum weight in the batch for numerical stability. Training uses a weighted loss:
PER does not change the Bellman target; it changes which experiences receive more update budget.
N-step returns, Distributional RL, and Noisy Networks
The improvements above focus on value estimation and data usage. Three more commonly combined ideas address reward propagation, value representation, and exploration.
N-step returns
Vanilla DQN uses a one-step TD target:
One-step targets have lower variance but slower reward propagation. With an -step return:
This can speed up learning in sparse/delayed reward tasks, at the cost of higher target variance.
Distributional RL
Standard DQN learns the expected return . But returns can be stochastic: the same action can lead to different outcomes.
Distributional RL models the distribution of returns and applies a distributional Bellman equation:
In practice (e.g. C51), the distribution is represented on discrete support points and the network outputs probabilities.
Noisy Networks
Epsilon-greedy exploration injects random actions, but the randomness is not state-structured. NoisyNet injects noise into network parameters, for example:
Because a sampled noise realization affects many states consistently, the exploration can be more coherent than per-step random actions.
Rainbow
Rainbow combines multiple DQN improvements into a single algorithm, typically including:
- Double DQN
- Dueling network
- Prioritized replay
- N-step returns
- Distributional RL
- Noisy networks
The value of the combination is that different components address different error sources: overestimation, function structure, sample efficiency, reward propagation, uncertainty representation, and exploration. In visual Atari tasks, these issues often co-occur, so combinations can outperform isolated improvements.
But combinations also bring more hyperparameters and more implementation complexity. For learning and debugging, it is still better to understand each component first, then decide whether the full combination is justified.
Exploration and Intrinsic Rewards
All improvements above assume the agent visits useful states. But if the agent never reaches critical states, no value-estimator can learn from missing data. That is the exploration problem.
Epsilon-greedy is the simplest approach:
- with probability , take a random action
- otherwise, take the greedy action
It is simple, but blind: it does not know which states are novel or informative.
One idea is to add intrinsic reward in addition to extrinsic environment reward:
In Intrinsic Curiosity Module (ICM), intrinsic reward is based on prediction error. Let be a learned feature representation. A forward model predicts next features from and action , and intrinsic reward can be:
Hard-to-predict transitions yield higher intrinsic reward and attract exploration.
Random Network Distillation (RND) is simpler: keep a random fixed target network and train a predictor to match it. Intrinsic reward is:
Frequently visited states become predictable (low error), while new states remain surprising (high error).
Section Summary
- DQN's max operator tends to cause overestimation bias; Double DQN reduces this by separating selection and evaluation.
- Dueling DQN decomposes into and , helping the network learn state quality and action differences more cleanly.
- PER focuses updates on high-TD-error transitions while using importance sampling to reduce bias.
- N-step returns speed up reward propagation; distributional RL models return distributions; NoisyNet provides structured exploration.
- Rainbow combines many of these, often improving Atari performance but increasing complexity.
- Intrinsic reward methods make exploration explicit and can help in sparse-reward settings.
Next, we move from vector states to pixel observations, where representation learning and training conditions become the dominant new challenges: Hands-on: Visual game projects.