Sample-based learning methods

Monte Carlo methods for prediction and control

Introduction to Monte Carlo methods

  • The dynamic programming approach assumed that we know the transition probabilities, which is unlikely in practice

  • Monte Carlo methods can be used to estimate the value function without this knowledge

Monte Carlo for control

  • We can estimate action values in the same way, though this requires sufficient exploration

  • Exploring starts pick the first state and action randomly, then follow the policy

Exploration methods for Monte Carlo

  • Exploring starts assume that we can sample the initial state and action randomly, but this may not be possible

  • Epsilon-soft policies take each action with probability at least εA\frac{\varepsilon}{|A|}

    • Generalization of epsilon-greedy policies to the stochastic case

    • We will end up finding the optimal epsilon-soft policy rather than the optimal policy, though the difference may be small in practice

Off-policy learning for prediction

  • Epsilon-soft policies are optimal for neither acting nor learning

  • Off-policy methods learn about a target policy π(as)\pi(a \mid s) from following a different policy (the behaviour policy b(as)b(a \mid s))

    • The target is usually the optimal policy

    • The behaviour policy generally explores more (though there are other applications such as learning from demonstration)

  • On-policy learning can be thought of as a special case where the target and behaviour policies are identical

  • Importance sampling can be used to reweight the returns from the behaviour policy

    • Eπ[X]=xXxp(x)b(x)=xXxp(x)b(x)b(x)b(x)=xXxρ(x)b(x)=Eb[Xρ(X)]\mathbb{E}_{\pi}[X] = \sum_{x \in X} xp(x)b(x) = \sum_{x \in X} x\frac{p(x)}{b(x)}b(x)b(x) = \sum_{x \in X} x\rho(x)b(x) = \mathbb{E}_b[X\rho(X)]

    • Eπ[X]1ni=1nxiρ(xi)xib\mathbb{E}_{\pi}[X] \approx \frac{1}{n}\sum_{i=1}^n x_i\rho(x_i) \\ x_i \sim b

    • ρt:T1k=tT1π(AkSk)b(AkSk)\rho_{t:T-1} \doteq \prod_{k=t}^{T-1} \frac{\pi(A_k|S_k)}{b(A_k|S_k)}

    • May have high variance, particularly for longer trajectories

    • Other approaches such as parametric estimates may have lower variance (at the cost of greater bias)

Temporal difference learning methods for prediction

Introduction to temporal difference learning

  • Incremental updates V(St)V(St)+α[GtV(St)]V(S_t) \leftarrow V(S_t) + \alpha[G_t - V(S_t)]

  • TD(0) algorithm

    • Set the estimated value of the terminal state to zero and initialize the others arbitrarily

    • Update estimates at each step according to V(St)V(St)+α[Rt+1+γV(St+1)V(St)]V(S_t) \leftarrow V(S_t) + \alpha[R_{t+1} + \gamma V(S_{t+1}) - V(S_t)]

Advantages of TD

  • Advantages

    • Don't require a model of the environment

    • Incremental online updates

    • Usually converge faster than Monte Carlo methods

Temporal difference learning methods for control

TD for control

  • SARSA algorithm

  • StAtRt+1St+1At+1S_tA_tR_{t+1}S_{t+1}A_{t+1}

  • Used to estimate action values

  • Q(St,At)Q(St,At)+α(Rt+1+γQ(St+1,At+1)Q(St,At))Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha(R_{t+1} + \gamma Q(S_{t+1},A_{t+1}) - Q(S_t,A_t))

  • Can be used in a generalized policy iteration framework

Off-policy TD control: Q-learning

  • Q-learning is very similar to SARSA, but learns off-policy

  • Q(St,At)Q(St,At)+α[Rt+1+γmaxaQ(St+1,a)Q(St,At)]Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha[R_{t+1} + \gamma \max_a Q(S_{t+1},a) - Q(S_t,A_t)]

  • SARSA : Bellman equation :: Q-learning : Bellman optimality equation

Q-learning learns the optimal policy given perfect execution, while SARSA learns the policy under current behaviour (useful when learning online)

Expected SARSA

  • Instead of using the actual next action (SARSA) or max next action (Q-learning), expected SARSA uses the expected value weighted over all possible next actions

  • Q(St,At)Q(St,At)+α[Rt+1+γaπ(aSt+1)Q(St+1,a)Q(St,At)]Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha[R_{t+1} + \gamma \sum_a \pi(a|S_{t+1})Q(S_{t+1},a) - Q(S_t,A_t)]

  • This reduces variance vs. SARSA (allowing a large step size), but increases the computational cost of an update

  • Reduces to Q-learning if using greedy policy as the target

Planning, learning, and acting

What is a model?

  • Models store information about the dynamics (transitions and rewards)

  • They can be used for planning (improving the policy)

    • e.g., by updating the value function based on simulated experiences

  • Sample models generate one sample at a time: St+1,Rt+1=Model(St,At)S'_{t+1}, R_{t+1} = Model(S_t, A_t)

  • Distribution models give the full distribution: P(St+1,Rt+1St,At)P(S'_{t+1}, R_{t+1} \mid S_t, A_t)

  • Sample models require less memory (consider a distribution model for the outcome of 100 dice rolls), but can't compute expected values directly

Planning

  • Random-sample one-step tabular Q-planning is similar to Q-learning but with simulated experience

    • Looks one step ahead

    • Randomly samples state-action pairs using a sample model

    • Uses a table for Q-values

Dyna as a formalism for planning

  • Dyna

    • Learns a model and makes direct RL updates based on real experience

    • Plans from simulated experience, taking a fixed number of planning steps per real step

  • Search control determines which states and actions to focus on during planning

  • Tabular Dyna-Q

    • Epsilon-greedy action

    • Q-learning update Q(S,A)Q(S,A)+α[R+γmaxaQ(S,a)Q(S,A)]Q(S,A) \leftarrow Q(S,A) + \alpha[R + \gamma \max_a Q(S',a) - Q(S,A)]

    • Model(S,A)R,SModel(S,A) \leftarrow R,S' (assumes deterministic environment)

    • Planning loop

      • Search control

        • SS \leftarrow random previously observed state

        • AA \leftarrow random action previously taken in SS

      • R,SModel(S,A)R,S \leftarrow Model(S,A)

      • Q-learning update from simulated data

Dealing with inaccurate models

  • Incomplete models will slow learning

  • Planning may make things worse if the environment is non-stationary

  • Might be worth exploring transitions that haven't been seen recently

    • One approach is to give additional reward for exploration

      • r+κτr+\kappa \sqrt{\tau} where τ\tau is the number of steps since the transition was last tried

      • Adding this to the planning updates in Dyna-Q gives the Dyna-Q+ algorithm

Last updated