Exploration Strategies in Deep Reinforcement Learning
Abstract
Effective exploration remains a fundamental challenge in reinforcement learning. We review classical and modern exploration strategies, analyzing their theoretical properties and empirical performance across different domains.
The Exploration-Exploitation Dilemma
Problem Statement
Agent must balance:
- Exploitation: Choose actions known to yield high rewards
- Exploration: Try new actions to discover potentially better strategies
Mathematical formulation:
max E[Σ r_t] subject to information gathering
Why It's Hard
- Sparse rewards (e.g., only at goal state)
- Large state/action spaces
- Stochastic environments
- Delayed consequences
Classical Exploration Methods
1. ε-Greedy
Algorithm:
def epsilon_greedy(Q, state, epsilon):
if random.random() < epsilon:
return random_action() # Explore
else:
return argmax(Q[state]) # Exploit
Pros: Simple, computationally cheap Cons: Inefficient, explores randomly
2. Softmax (Boltzmann)
Policy:
π(a|s) = exp(Q(s,a)/τ) / Σ exp(Q(s,a')/τ)
Temperature τ controls exploration:
- High τ: More uniform (explore)
- Low τ: More greedy (exploit)
3. Upper Confidence Bound (UCB)
Action selection:
a* = argmax[Q(s,a) + c√(ln(t)/N(s,a))]
Intuition: Bonus for rarely-visited actions
Guarantee: Logarithmic regret bound
Intrinsic Motivation Methods
Curiosity-Driven Exploration
Idea: Reward agent for discovering novel states
Implementations:
Count-Based
Bonus = β/√N(s)
Where N(s) is visit count.
Challenge: Doesn't scale to large state spaces
Prediction Error
Bonus = ||f(s_{t+1}) - f̂(s_{t+1})||²
Reward for states that are hard to predict.
Example: Intrinsic Curiosity Module (ICM)
Random Network Distillation (RND)
Method:
- Fixed random network f
- Learned network f̂
- Bonus = ||f(s) - f̂(s)||²
Advantage: Naturally handles familiar vs. novel states
Information Theory Approaches
Empowerment
Maximize mutual information between actions and states:
max I(A_t; S_{t+k} | S_t)
Interpretation: Choose actions that maximize control over future
Maximum Entropy RL
Add entropy term to objective:
J = E[Σ r_t + α H(π(·|s_t))]
Benefit: Prevents premature convergence
Example: SAC (Soft Actor-Critic)
Deep RL Specific Methods
NoisyNet
Approach: Add learnable noise to network parameters
class NoisyLinear(nn.Module):
def __init__(self, in_features, out_features):
self.weight_mu = nn.Parameter(...)
self.weight_sigma = nn.Parameter(...)
def forward(self, x):
weight = self.weight_mu + self.weight_sigma * noise
return F.linear(x, weight)
Benefit: State-dependent exploration
Bootstrapped DQN
Idea: Maintain ensemble of Q-networks
Exploration: Sample random network from ensemble
Intuition: Uncertainty → exploration
Parameter Space Noise
Add noise to policy parameters instead of actions:
θ̃ = θ + N(0, σ²I)
Advantage: Temporally correlated exploration
Goal-Conditioned Methods
Hindsight Experience Replay (HER)
Key insight: Learn from failures by relabeling goals
Algorithm:
- Attempt to reach goal g, end at g'
- Replay with goal g' (which we actually achieved)
- Learn Q(s,a|g')
Result: Sparse rewards become dense
Automatic Curriculum Learning
Idea: Gradually increase task difficulty
Approaches:
- Teacher-student (teacher proposes goals)
- Self-play (agent generates own tasks)
- Asymmetric self-play
Hierarchical Exploration
Options Framework
Concept: Learn temporally extended actions (options)
Hierarchy:
High-level policy → Option selection
Low-level policy → Primitive actions
Benefit: Explore in abstract action space
Feudal Networks
Structure:
- Manager: Sets sub-goals
- Worker: Achieves sub-goals
Exploration: Manager explores goal space, worker explores action space
Benchmarks and Evaluation
Atari Games
- DQN baseline
- Improvements from NoisyNet, RND
MuJoCo Continuous Control
- Sparse reward variants
- Comparison: ε-greedy vs. entropy bonus
Montezuma's Revenge
- Classic hard exploration game
- Requires curiosity-driven methods
Procedurally Generated Environments
- Test generalization
- Avoid overfitting to specific layouts
Theoretical Analysis
Regret Bounds
Definition:
Regret = Σ (V* - V^π_t)
Results:
- ε-greedy: Ω(T)
- UCB: O(log T)
- Thompson Sampling: O(log T)
Sample Complexity
Question: How many samples needed to learn near-optimal policy?
Depends on:
- MDP structure (diameter, mixing time)
- Exploration strategy
- Function approximation
Practical Recommendations
General Guidelines
- Start simple: ε-greedy often works
- Dense rewards: Less exploration needed
- Sparse rewards: Try curiosity methods
- Continuous control: Add entropy bonus
- Hard exploration: Consider HER or RND
Hyperparameter Tips
ε decay schedule:
epsilon = max(epsilon_min, epsilon_start * decay_rate^step)
Typical: Start at 1.0, end at 0.01-0.1
Curiosity bonus coefficient: Start small (0.01-0.1), tune based on reward scale
Open Problems
- Exploration in very large action spaces
- Efficient exploration with continuous actions
- Transfer of exploration strategies
- Safe exploration in real-world systems
- Theoretical understanding of deep RL exploration
Implementation Example
class CuriosityDrivenAgent:
def __init__(self, env):
self.policy = PolicyNetwork()
self.forward_model = ForwardModel()
self.inverse_model = InverseModel()
def intrinsic_reward(self, state, action, next_state):
# Predict next state
pred_next = self.forward_model(state, action)
# Prediction error = curiosity
return ||pred_next - next_state||²
def update(self, state, action, reward, next_state):
# Augment extrinsic reward with intrinsic
curiosity = self.intrinsic_reward(state, action, next_state)
total_reward = reward + β * curiosity
# Standard RL update with augmented reward
self.policy.update(state, action, total_reward)
Conclusion
Exploration remains an active research area with significant practical impact. The choice of exploration strategy should be guided by the problem structure, reward sparsity, and computational constraints. Modern methods combining intrinsic motivation with deep RL show promise but require careful tuning.
References
- Sutton & Barto, "Reinforcement Learning: An Introduction" (2018)
- Pathak et al., "Curiosity-driven Exploration by Self-supervised Prediction" (2017)
- Burda et al., "Exploration by Random Network Distillation" (2019)
- Andrychowicz et al., "Hindsight Experience Replay" (2017)
- Fortunato et al., "Noisy Networks for Exploration" (2018)