Raoul Harris
  • Introduction
  • Technical books
    • Data engineering with Alteryx
    • Deep learning in Python
    • Generative AI in action
    • Generative deep learning
    • Outlier analysis
    • Understanding deep learning
    • Understanding machine learning: from theory to algorithms (in progress)
    • Review: Deep learning: foundations and concepts
  • Technical courses
    • Advanced SQL Server masterclass for data analytics
    • Building full-stack apps with AI
    • Complete Cursor
    • DataOps methodology
    • DeepLearning.AI short courses
    • Generative AI for software development
      • Introduction to generative AI for software development
      • Team software engineering with AI
      • AI-powered software and system design
    • Generative AI with large language models
    • Generative pre-trained transformers
    • IBM DevOps and software engineering
      • Introduction to agile development and scrum
      • Introduction to cloud computing
      • Introduction to DevOps
    • Machine learning in production
    • Reinforcement learning specialization
      • Fundamentals of reinforcement learning
      • Sample-based learning methods
      • Prediction and control with function approximation
  • Non-technical books
    • Management skills for everyday life (in progress)
  • Non-technical courses
    • Business communication and effective communication specializations
      • Business writing
      • Graphic design
      • Successful presentation
      • Giving helpful feedback (not started)
      • Communicating effectively in groups (not started)
    • Illinois Tech MBA courses
      • Competitive strategy (in progress)
    • Leading people and teams specialization
      • Inspiring and motivating individuals
      • Managing talent
      • Influencing people
      • Leading teams
Powered by GitBook
On this page
  • Monte Carlo methods for prediction and control
  • Introduction to Monte Carlo methods
  • Monte Carlo for control
  • Exploration methods for Monte Carlo
  • Off-policy learning for prediction
  • Temporal difference learning methods for prediction
  • Introduction to temporal difference learning
  • Advantages of TD
  • Temporal difference learning methods for control
  • TD for control
  • Off-policy TD control: Q-learning
  • Expected SARSA
  • Planning, learning, and acting
  • What is a model?
  • Planning
  • Dyna as a formalism for planning
  • Dealing with inaccurate models
  1. Technical courses
  2. Reinforcement learning specialization

Sample-based learning methods

Monte Carlo methods for prediction and control

Introduction to Monte Carlo methods

  • The dynamic programming approach assumed that we know the transition probabilities, which is unlikely in practice

  • Monte Carlo methods can be used to estimate the value function without this knowledge

Monte Carlo for control

  • We can estimate action values in the same way, though this requires sufficient exploration

  • Exploring starts pick the first state and action randomly, then follow the policy

Exploration methods for Monte Carlo

  • Exploring starts assume that we can sample the initial state and action randomly, but this may not be possible

  • Epsilon-soft policies take each action with probability at least ε∣A∣\frac{\varepsilon}{|A|}∣A∣ε​

    • Generalization of epsilon-greedy policies to the stochastic case

    • We will end up finding the optimal epsilon-soft policy rather than the optimal policy, though the difference may be small in practice

Off-policy learning for prediction

  • Epsilon-soft policies are optimal for neither acting nor learning

  • Off-policy methods learn about a target policy π(a∣s)\pi(a \mid s)π(a∣s) from following a different policy (the behaviour policy b(a∣s)b(a \mid s)b(a∣s))

    • The target is usually the optimal policy

    • The behaviour policy generally explores more (though there are other applications such as learning from demonstration)

  • On-policy learning can be thought of as a special case where the target and behaviour policies are identical

  • Importance sampling can be used to reweight the returns from the behaviour policy

    • Eπ[X]=∑x∈Xxp(x)b(x)=∑x∈Xxp(x)b(x)b(x)b(x)=∑x∈Xxρ(x)b(x)=Eb[Xρ(X)]\mathbb{E}_{\pi}[X] = \sum_{x \in X} xp(x)b(x) = \sum_{x \in X} x\frac{p(x)}{b(x)}b(x)b(x) = \sum_{x \in X} x\rho(x)b(x) = \mathbb{E}_b[X\rho(X)]Eπ​[X]=∑x∈X​xp(x)b(x)=∑x∈X​xb(x)p(x)​b(x)b(x)=∑x∈X​xρ(x)b(x)=Eb​[Xρ(X)]

    • Eπ[X]≈1n∑i=1nxiρ(xi)xi∼b\mathbb{E}_{\pi}[X] \approx \frac{1}{n}\sum_{i=1}^n x_i\rho(x_i) \\ x_i \sim bEπ​[X]≈n1​∑i=1n​xi​ρ(xi​)xi​∼b

    • ρt:T−1≐∏k=tT−1π(Ak∣Sk)b(Ak∣Sk)\rho_{t:T-1} \doteq \prod_{k=t}^{T-1} \frac{\pi(A_k|S_k)}{b(A_k|S_k)}ρt:T−1​≐∏k=tT−1​b(Ak​∣Sk​)π(Ak​∣Sk​)​

    • May have high variance, particularly for longer trajectories

    • Other approaches such as parametric estimates may have lower variance (at the cost of greater bias)

Temporal difference learning methods for prediction

Introduction to temporal difference learning

  • Incremental updates V(St)←V(St)+α[Gt−V(St)]V(S_t) \leftarrow V(S_t) + \alpha[G_t - V(S_t)] V(St​)←V(St​)+α[Gt​−V(St​)]

  • TD(0) algorithm

    • Set the estimated value of the terminal state to zero and initialize the others arbitrarily

    • Update estimates at each step according to V(St)←V(St)+α[Rt+1+γV(St+1)−V(St)]V(S_t) \leftarrow V(S_t) + \alpha[R_{t+1} + \gamma V(S_{t+1}) - V(S_t)]V(St​)←V(St​)+α[Rt+1​+γV(St+1​)−V(St​)]

Advantages of TD

  • Advantages

    • Don't require a model of the environment

    • Incremental online updates

    • Usually converge faster than Monte Carlo methods

Temporal difference learning methods for control

TD for control

  • SARSA algorithm

  • StAtRt+1St+1At+1S_tA_tR_{t+1}S_{t+1}A_{t+1}St​At​Rt+1​St+1​At+1​

  • Used to estimate action values

  • Q(St,At)←Q(St,At)+α(Rt+1+γQ(St+1,At+1)−Q(St,At))Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha(R_{t+1} + \gamma Q(S_{t+1},A_{t+1}) - Q(S_t,A_t))Q(St​,At​)←Q(St​,At​)+α(Rt+1​+γQ(St+1​,At+1​)−Q(St​,At​))

  • Can be used in a generalized policy iteration framework

Off-policy TD control: Q-learning

  • Q-learning is very similar to SARSA, but learns off-policy

  • Q(St,At)←Q(St,At)+α[Rt+1+γmax⁡aQ(St+1,a)−Q(St,At)]Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha[R_{t+1} + \gamma \max_a Q(S_{t+1},a) - Q(S_t,A_t)]Q(St​,At​)←Q(St​,At​)+α[Rt+1​+γmaxa​Q(St+1​,a)−Q(St​,At​)]

  • SARSA : Bellman equation :: Q-learning : Bellman optimality equation

Expected SARSA

  • Instead of using the actual next action (SARSA) or max next action (Q-learning), expected SARSA uses the expected value weighted over all possible next actions

  • Q(St,At)←Q(St,At)+α[Rt+1+γ∑aπ(a∣St+1)Q(St+1,a)−Q(St,At)]Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha[R_{t+1} + \gamma \sum_a \pi(a|S_{t+1})Q(S_{t+1},a) - Q(S_t,A_t)]Q(St​,At​)←Q(St​,At​)+α[Rt+1​+γ∑a​π(a∣St+1​)Q(St+1​,a)−Q(St​,At​)]

  • This reduces variance vs. SARSA (allowing a large step size), but increases the computational cost of an update

  • Reduces to Q-learning if using greedy policy as the target

Planning, learning, and acting

What is a model?

  • Models store information about the dynamics (transitions and rewards)

  • They can be used for planning (improving the policy)

    • e.g., by updating the value function based on simulated experiences

  • Sample models generate one sample at a time: St+1′,Rt+1=Model(St,At)S'_{t+1}, R_{t+1} = Model(S_t, A_t)St+1′​,Rt+1​=Model(St​,At​)

  • Distribution models give the full distribution: P(St+1′,Rt+1∣St,At)P(S'_{t+1}, R_{t+1} \mid S_t, A_t)P(St+1′​,Rt+1​∣St​,At​)

  • Sample models require less memory (consider a distribution model for the outcome of 100 dice rolls), but can't compute expected values directly

Planning

  • Random-sample one-step tabular Q-planning is similar to Q-learning but with simulated experience

    • Looks one step ahead

    • Randomly samples state-action pairs using a sample model

    • Uses a table for Q-values

Dyna as a formalism for planning

  • Dyna

    • Learns a model and makes direct RL updates based on real experience

    • Plans from simulated experience, taking a fixed number of planning steps per real step

  • Search control determines which states and actions to focus on during planning

  • Tabular Dyna-Q

    • Epsilon-greedy action

    • Q-learning update Q(S,A)←Q(S,A)+α[R+γmax⁡aQ(S′,a)−Q(S,A)]Q(S,A) \leftarrow Q(S,A) + \alpha[R + \gamma \max_a Q(S',a) - Q(S,A)]Q(S,A)←Q(S,A)+α[R+γmaxa​Q(S′,a)−Q(S,A)]

    • Model(S,A)←R,S′Model(S,A) \leftarrow R,S'Model(S,A)←R,S′ (assumes deterministic environment)

    • Planning loop

      • Search control

        • S←S \leftarrowS← random previously observed state

        • A←A \leftarrowA← random action previously taken in SSS

      • R,S←Model(S,A)R,S \leftarrow Model(S,A)R,S←Model(S,A)

      • Q-learning update from simulated data

Dealing with inaccurate models

  • Incomplete models will slow learning

  • Planning may make things worse if the environment is non-stationary

  • Might be worth exploring transitions that haven't been seen recently

    • One approach is to give additional reward for exploration

      • r+κτr+\kappa \sqrt{\tau}r+κτ​ where τ\tauτ is the number of steps since the transition was last tried

      • Adding this to the planning updates in Dyna-Q gives the Dyna-Q+ algorithm

Last updated 6 months ago

Q-learning learns the optimal policy given perfect execution, while SARSA learns the policy under current behaviour (useful when learning online)