Sample-based learning methods

Monte Carlo methods for prediction and control

The dynamic programming approach assumed that we know the transition probabilities, which is unlikely in practice
Monte Carlo methods can be used to estimate the value function without this knowledge

We can estimate action values in the same way, though this requires sufficient exploration
Exploring starts pick the first state and action randomly, then follow the policy

Exploring starts assume that we can sample the initial state and action randomly, but this may not be possible
Epsilon-soft policies take each action with probability at least $\frac{\varepsilon}{|A|}$
- Generalization of epsilon-greedy policies to the stochastic case
- We will end up finding the optimal epsilon-soft policy rather than the optimal policy, though the difference may be small in practice

Epsilon-soft policies are optimal for neither acting nor learning
Off-policy methods learn about a target policy $\pi(a \mid s)$ from following a different policy (the behaviour policy $b(a \mid s)$ )
- The target is usually the optimal policy
- The behaviour policy generally explores more (though there are other applications such as learning from demonstration)
On-policy learning can be thought of as a special case where the target and behaviour policies are identical
Importance sampling can be used to reweight the returns from the behaviour policy
- $\mathbb{E}_{\pi}[X] = \sum_{x \in X} xp(x)b(x) = \sum_{x \in X} x\frac{p(x)}{b(x)}b(x)b(x) = \sum_{x \in X} x\rho(x)b(x) = \mathbb{E}_b[X\rho(X)]$
- $\mathbb{E}_{\pi}[X] \approx \frac{1}{n}\sum_{i=1}^n x_i\rho(x_i) \\ x_i \sim b$
- $\rho_{t:T-1} \doteq \prod_{k=t}^{T-1} \frac{\pi(A_k|S_k)}{b(A_k|S_k)}$
- May have high variance, particularly for longer trajectories
- Other approaches such as parametric estimates may have lower variance (at the cost of greater bias)

Incremental updates $V(S_t) \leftarrow V(S_t) + \alpha[G_t - V(S_t)]$
TD(0) algorithm
- Set the estimated value of the terminal state to zero and initialize the others arbitrarily
- Update estimates at each step according to $V(S_t) \leftarrow V(S_t) + \alpha[R_{t+1} + \gamma V(S_{t+1}) - V(S_t)]$

Advantages
- Don't require a model of the environment
- Incremental online updates
- Usually converge faster than Monte Carlo methods

SARSA algorithm
$S_tA_tR_{t+1}S_{t+1}A_{t+1}$
Used to estimate action values
$Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha(R_{t+1} + \gamma Q(S_{t+1},A_{t+1}) - Q(S_t,A_t))$
Can be used in a generalized policy iteration framework

Q-learning is very similar to SARSA, but learns off-policy
$Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha[R_{t+1} + \gamma \max_a Q(S_{t+1},a) - Q(S_t,A_t)]$
SARSA : Bellman equation :: Q-learning : Bellman optimality equation

Instead of using the actual next action (SARSA) or max next action (Q-learning), expected SARSA uses the expected value weighted over all possible next actions
$Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha[R_{t+1} + \gamma \sum_a \pi(a|S_{t+1})Q(S_{t+1},a) - Q(S_t,A_t)]$
This reduces variance vs. SARSA (allowing a large step size), but increases the computational cost of an update
Reduces to Q-learning if using greedy policy as the target

Models store information about the dynamics (transitions and rewards)
They can be used for planning (improving the policy)
- e.g., by updating the value function based on simulated experiences
Sample models generate one sample at a time: $S'_{t+1}, R_{t+1} = Model(S_t, A_t)$
Distribution models give the full distribution: $P(S'_{t+1}, R_{t+1} \mid S_t, A_t)$
Sample models require less memory (consider a distribution model for the outcome of 100 dice rolls), but can't compute expected values directly

Random-sample one-step tabular Q-planning is similar to Q-learning but with simulated experience
- Looks one step ahead
- Randomly samples state-action pairs using a sample model
- Uses a table for Q-values

Dyna
- Learns a model and makes direct RL updates based on real experience
- Plans from simulated experience, taking a fixed number of planning steps per real step
Search control determines which states and actions to focus on during planning
Tabular Dyna-Q
- Epsilon-greedy action
- Q-learning update $Q(S,A) \leftarrow Q(S,A) + \alpha[R + \gamma \max_a Q(S',a) - Q(S,A)]$
- $Model(S,A) \leftarrow R,S'$ (assumes deterministic environment)
- Planning loop
  - Search control
    $S \leftarrow$ random previously observed state
    $A \leftarrow$ random action previously taken in $S$
  - $R,S \leftarrow Model(S,A)$
  - Q-learning update from simulated data

Incomplete models will slow learning
Planning may make things worse if the environment is non-stationary
Might be worth exploring transitions that haven't been seen recently
- One approach is to give additional reward for exploration
  - $r+\kappa \sqrt{\tau}$ where $\tau$ is the number of steps since the transition was last tried
  - Adding this to the planning updates in Dyna-Q gives the Dyna-Q+ algorithm

Last updated 11 months ago