Other Research

Exploration Strategies in Deep Reinforcement Learning

Chinmaya Sahu5 min read902 wordsUpdated 2018-10-20
Reinforcement LearningAIExploration

Exploration Strategies in Deep Reinforcement Learning

Abstract

Effective exploration remains a fundamental challenge in reinforcement learning. We review classical and modern exploration strategies, analyzing their theoretical properties and empirical performance across different domains.

The Exploration-Exploitation Dilemma

Problem Statement

Agent must balance:

  • Exploitation: Choose actions known to yield high rewards
  • Exploration: Try new actions to discover potentially better strategies

Mathematical formulation:

max E[Σ r_t] subject to information gathering

Why It's Hard

  • Sparse rewards (e.g., only at goal state)
  • Large state/action spaces
  • Stochastic environments
  • Delayed consequences

Classical Exploration Methods

1. ε-Greedy

Algorithm:

def epsilon_greedy(Q, state, epsilon):
    if random.random() < epsilon:
        return random_action()  # Explore
    else:
        return argmax(Q[state])  # Exploit

Pros: Simple, computationally cheap Cons: Inefficient, explores randomly

2. Softmax (Boltzmann)

Policy:

π(a|s) = exp(Q(s,a)/τ) / Σ exp(Q(s,a')/τ)

Temperature τ controls exploration:

  • High τ: More uniform (explore)
  • Low τ: More greedy (exploit)

3. Upper Confidence Bound (UCB)

Action selection:

a* = argmax[Q(s,a) + c√(ln(t)/N(s,a))]

Intuition: Bonus for rarely-visited actions

Guarantee: Logarithmic regret bound

Intrinsic Motivation Methods

Curiosity-Driven Exploration

Idea: Reward agent for discovering novel states

Implementations:

Count-Based

Bonus = β/√N(s)

Where N(s) is visit count.

Challenge: Doesn't scale to large state spaces

Prediction Error

Bonus = ||f(s_{t+1}) - f̂(s_{t+1})||²

Reward for states that are hard to predict.

Example: Intrinsic Curiosity Module (ICM)

Random Network Distillation (RND)

Method:

  1. Fixed random network f
  2. Learned network f̂
  3. Bonus = ||f(s) - f̂(s)||²

Advantage: Naturally handles familiar vs. novel states

Information Theory Approaches

Empowerment

Maximize mutual information between actions and states:

max I(A_t; S_{t+k} | S_t)

Interpretation: Choose actions that maximize control over future

Maximum Entropy RL

Add entropy term to objective:

J = E[Σ r_t + α H(π(·|s_t))]

Benefit: Prevents premature convergence

Example: SAC (Soft Actor-Critic)

Deep RL Specific Methods

NoisyNet

Approach: Add learnable noise to network parameters

class NoisyLinear(nn.Module):
    def __init__(self, in_features, out_features):
        self.weight_mu = nn.Parameter(...)
        self.weight_sigma = nn.Parameter(...)
        
    def forward(self, x):
        weight = self.weight_mu + self.weight_sigma * noise
        return F.linear(x, weight)

Benefit: State-dependent exploration

Bootstrapped DQN

Idea: Maintain ensemble of Q-networks

Exploration: Sample random network from ensemble

Intuition: Uncertainty → exploration

Parameter Space Noise

Add noise to policy parameters instead of actions:

θ̃ = θ + N(0, σ²I)

Advantage: Temporally correlated exploration

Goal-Conditioned Methods

Hindsight Experience Replay (HER)

Key insight: Learn from failures by relabeling goals

Algorithm:

  1. Attempt to reach goal g, end at g'
  2. Replay with goal g' (which we actually achieved)
  3. Learn Q(s,a|g')

Result: Sparse rewards become dense

Automatic Curriculum Learning

Idea: Gradually increase task difficulty

Approaches:

  • Teacher-student (teacher proposes goals)
  • Self-play (agent generates own tasks)
  • Asymmetric self-play

Hierarchical Exploration

Options Framework

Concept: Learn temporally extended actions (options)

Hierarchy:

High-level policy → Option selection
Low-level policy → Primitive actions

Benefit: Explore in abstract action space

Feudal Networks

Structure:

  • Manager: Sets sub-goals
  • Worker: Achieves sub-goals

Exploration: Manager explores goal space, worker explores action space

Benchmarks and Evaluation

Atari Games

  • DQN baseline
  • Improvements from NoisyNet, RND

MuJoCo Continuous Control

  • Sparse reward variants
  • Comparison: ε-greedy vs. entropy bonus

Montezuma's Revenge

  • Classic hard exploration game
  • Requires curiosity-driven methods

Procedurally Generated Environments

  • Test generalization
  • Avoid overfitting to specific layouts

Theoretical Analysis

Regret Bounds

Definition:

Regret = Σ (V* - V^π_t)

Results:

  • ε-greedy: Ω(T)
  • UCB: O(log T)
  • Thompson Sampling: O(log T)

Sample Complexity

Question: How many samples needed to learn near-optimal policy?

Depends on:

  • MDP structure (diameter, mixing time)
  • Exploration strategy
  • Function approximation

Practical Recommendations

General Guidelines

  1. Start simple: ε-greedy often works
  2. Dense rewards: Less exploration needed
  3. Sparse rewards: Try curiosity methods
  4. Continuous control: Add entropy bonus
  5. Hard exploration: Consider HER or RND

Hyperparameter Tips

ε decay schedule:

epsilon = max(epsilon_min, epsilon_start * decay_rate^step)

Typical: Start at 1.0, end at 0.01-0.1

Curiosity bonus coefficient: Start small (0.01-0.1), tune based on reward scale

Open Problems

  1. Exploration in very large action spaces
  2. Efficient exploration with continuous actions
  3. Transfer of exploration strategies
  4. Safe exploration in real-world systems
  5. Theoretical understanding of deep RL exploration

Implementation Example

class CuriosityDrivenAgent:
    def __init__(self, env):
        self.policy = PolicyNetwork()
        self.forward_model = ForwardModel()
        self.inverse_model = InverseModel()
        
    def intrinsic_reward(self, state, action, next_state):
        # Predict next state
        pred_next = self.forward_model(state, action)
        
        # Prediction error = curiosity
        return ||pred_next - next_state||²
    
    def update(self, state, action, reward, next_state):
        # Augment extrinsic reward with intrinsic
        curiosity = self.intrinsic_reward(state, action, next_state)
        total_reward = reward + β * curiosity
        
        # Standard RL update with augmented reward
        self.policy.update(state, action, total_reward)

Conclusion

Exploration remains an active research area with significant practical impact. The choice of exploration strategy should be guided by the problem structure, reward sparsity, and computational constraints. Modern methods combining intrinsic motivation with deep RL show promise but require careful tuning.

References

  1. Sutton & Barto, "Reinforcement Learning: An Introduction" (2018)
  2. Pathak et al., "Curiosity-driven Exploration by Self-supervised Prediction" (2017)
  3. Burda et al., "Exploration by Random Network Distillation" (2019)
  4. Andrychowicz et al., "Hindsight Experience Replay" (2017)
  5. Fortunato et al., "Noisy Networks for Exploration" (2018)