Sample-based learning methods
Monte Carlo methods for prediction and control
Introduction to Monte Carlo methods
The dynamic programming approach assumed that we know the transition probabilities, which is unlikely in practice
Monte Carlo methods can be used to estimate the value function without this knowledge
Monte Carlo for control
We can estimate action values in the same way, though this requires sufficient exploration
Exploring starts pick the first state and action randomly, then follow the policy
Exploration methods for Monte Carlo
Exploring starts assume that we can sample the initial state and action randomly, but this may not be possible
Epsilon-soft policies take each action with probability at least
Generalization of epsilon-greedy policies to the stochastic case
We will end up finding the optimal epsilon-soft policy rather than the optimal policy, though the difference may be small in practice
Off-policy learning for prediction
Epsilon-soft policies are optimal for neither acting nor learning
Off-policy methods learn about a target policy from following a different policy (the behaviour policy )
The target is usually the optimal policy
The behaviour policy generally explores more (though there are other applications such as learning from demonstration)
On-policy learning can be thought of as a special case where the target and behaviour policies are identical
Importance sampling can be used to reweight the returns from the behaviour policy
May have high variance, particularly for longer trajectories
Other approaches such as parametric estimates may have lower variance (at the cost of greater bias)
Temporal difference learning methods for prediction
Introduction to temporal difference learning
Incremental updates
TD(0) algorithm
Set the estimated value of the terminal state to zero and initialize the others arbitrarily
Update estimates at each step according to
Advantages of TD
Advantages
Don't require a model of the environment
Incremental online updates
Usually converge faster than Monte Carlo methods
Temporal difference learning methods for control
TD for control
SARSA algorithm
Used to estimate action values
Can be used in a generalized policy iteration framework
Off-policy TD control: Q-learning
Q-learning is very similar to SARSA, but learns off-policy
SARSA : Bellman equation :: Q-learning : Bellman optimality equation
Expected SARSA
Instead of using the actual next action (SARSA) or max next action (Q-learning), expected SARSA uses the expected value weighted over all possible next actions
This reduces variance vs. SARSA (allowing a large step size), but increases the computational cost of an update
Reduces to Q-learning if using greedy policy as the target
Planning, learning, and acting
What is a model?
Models store information about the dynamics (transitions and rewards)
They can be used for planning (improving the policy)
e.g., by updating the value function based on simulated experiences
Sample models generate one sample at a time:
Distribution models give the full distribution:
Sample models require less memory (consider a distribution model for the outcome of 100 dice rolls), but can't compute expected values directly
Planning
Random-sample one-step tabular Q-planning is similar to Q-learning but with simulated experience
Looks one step ahead
Randomly samples state-action pairs using a sample model
Uses a table for Q-values
Dyna as a formalism for planning
Dyna
Learns a model and makes direct RL updates based on real experience
Plans from simulated experience, taking a fixed number of planning steps per real step
Search control determines which states and actions to focus on during planning
Tabular Dyna-Q
Epsilon-greedy action
Q-learning update
(assumes deterministic environment)
Planning loop
Search control
random previously observed state
random action previously taken in
Q-learning update from simulated data
Dealing with inaccurate models
Incomplete models will slow learning
Planning may make things worse if the environment is non-stationary
Might be worth exploring transitions that haven't been seen recently
One approach is to give additional reward for exploration
where is the number of steps since the transition was last tried
Adding this to the planning updates in Dyna-Q gives the Dyna-Q+ algorithm
Last updated