Raoul Harris
  • Introduction
  • Technical books
    • Data engineering with Alteryx
    • Deep learning in Python
    • Generative AI in action
    • Generative deep learning
    • Outlier analysis
    • Understanding deep learning
    • Understanding machine learning: from theory to algorithms (in progress)
    • Review: Deep learning: foundations and concepts
  • Technical courses
    • Advanced SQL Server masterclass for data analytics
    • Building full-stack apps with AI
    • Complete Cursor
    • DataOps methodology
    • DeepLearning.AI short courses
    • Generative AI for software development
      • Introduction to generative AI for software development
      • Team software engineering with AI
      • AI-powered software and system design
    • Generative AI with large language models
    • Generative pre-trained transformers
    • IBM DevOps and software engineering
      • Introduction to agile development and scrum
      • Introduction to cloud computing
      • Introduction to DevOps
    • Machine learning in production
    • Reinforcement learning specialization
      • Fundamentals of reinforcement learning
      • Sample-based learning methods
      • Prediction and control with function approximation
  • Non-technical books
    • Management skills for everyday life (in progress)
  • Non-technical courses
    • Business communication and effective communication specializations
      • Business writing
      • Graphic design
      • Successful presentation
      • Giving helpful feedback (not started)
      • Communicating effectively in groups (not started)
    • Illinois Tech MBA courses
      • Competitive strategy (in progress)
    • Leading people and teams specialization
      • Inspiring and motivating individuals
      • Managing talent
      • Influencing people
      • Leading teams
Powered by GitBook
On this page
  • On-policy prediction with approximation
  • Estimating value functions with supervised learning
  • The objective for on-policy prediction
  • The objective for TD
  • Linear TD
  • Constructing features for prediction
  • Feature construction for linear methods
  • Control with approximation
  • Exploration under function approximation
  • Average reward
  • Policy gradient
  • Learning parameterized policies
  • Policy gradients for continuing tasks
  • Actor-critic for continuing tasks
  1. Technical courses
  2. Reinforcement learning specialization

Prediction and control with function approximation

On-policy prediction with approximation

Estimating value functions with supervised learning

  • In linear value feature approximation, the state value is approximated as a linear combination of state features

    • v^(s,w)≐∑iwixi(s)=⟨w,x(s)⟩\hat{v}(s, w) \doteq \sum_i w_i x_i(s) = \langle w, x(s) \ranglev^(s,w)≐∑i​wi​xi​(s)=⟨w,x(s)⟩

    • The tabular approach is a special case where xxx is an indicator function

  • We want to be able to generalize from similar states, but still discriminate between states where relevant

    • Tabular methods have perfect discrimination but no generalization

    • Aggregating all states would result in no discrimination

The objective for on-policy prediction

  • Mean squared value error

    • ∑sμ(s)[vπ(s)−v^(s,w)]2\sum_s \mu(s)[v_\pi(s) - \hat{v}(s, w)]^2∑s​μ(s)[vπ​(s)−v^(s,w)]2

    • μ(s)\mu(s)μ(s) represents the importance of a given state, which will generally be the proportion of time spent there

  • We can optimize the objective function by combining stochastic gradient descent with Monte Carlo return estimates

    • wt+1≐wt+α[Gt−v^(St,wt)]∇v^(s,wt)w_{t+1} \doteq w_t + \alpha[G_t - \hat{v}(S_t, w_t)]\nabla\hat{v}(s, w_t)wt+1​≐wt​+α[Gt​−v^(St​,wt​)]∇v^(s,wt​)

  • State aggregation partitions the states into sets whose values are assumed to be equal

    • You're basically approximating the table by a smaller table

    • Like tabular methods, this is a special case of linear value feature approximation

The objective for TD

  • In semi-gradient TD(0), we use the TD(0) estimate for the return in place of GtG_tGt​

  • wt+1≐wt+α[Rt+γv^(St+1,wt)−v^(St,wt)]∇v^(s,wt)w_{t+1} \doteq w_t + \alpha[R_t +\gamma \hat{v}(S_{t+1}, w_t) - \hat{v}(S_t, w_t)]\nabla\hat{v}(s, w_t)wt+1​≐wt​+α[Rt​+γv^(St+1​,wt​)−v^(St​,wt​)]∇v^(s,wt​)

  • This is a semi-gradient method as it treats the return estimate as independent of www

  • This estimate is biased, but it converges faster than the Monte Carlo approach

Linear TD

  • With linear approximation, this will converge to a point where the expected update is zero (the TD fixed point)

  • E[ΔwTD]=α(b−AwTD)=0\mathbb{E}[\Delta w_{TD}] = \alpha(b - Aw_{TD}) = 0E[ΔwTD​]=α(b−AwTD​)=0

  • This will be close to the true minimum if γ\gammaγ is small

  • The situation with non-linear function approximation can be messier (e.g., multiple fixed points)

Constructing features for prediction

Feature construction for linear methods

  • Coarse coding is a generalization of state aggregation where the receptive fields (regions) can overlap, allowing better discrimination

  • Tile coding is a type of coarse coding based on overlapping tilings

  • The tiles are usually square, in which case this approach is computationally efficient

Control with approximation

  • SARSA

    • w←w+α[R+γq^(S′,A′,w)−q^(S,A,w)]∇q^(S,A,w)w \leftarrow w + \alpha[R + \gamma \hat{q}(S', A', w) - \hat{q}(S, A, w)]\nabla\hat{q}(S, A, w)w←w+α[R+γq^​(S′,A′,w)−q^​(S,A,w)]∇q^​(S,A,w)

  • Expected SARSA

    • w←w+α[R+γ∑a′π(a′∣S′)q^(S′,a′,w)−q^(S,A,w)]∇q^(S,A,w)w \leftarrow w + \alpha[R + \gamma \sum_{a'} \pi(a'|S')\hat{q}(S', a', w) - \hat{q}(S, A, w)]\nabla\hat{q}(S, A, w)w←w+α[R+γ∑a′​π(a′∣S′)q^​(S′,a′,w)−q^​(S,A,w)]∇q^​(S,A,w)

  • Q-learning

    • w←w+α[R+γmax⁡a′q^(S′,a′,w)−q^(S,A,w)]∇q^(S,A,w)w \leftarrow w + \alpha[R + \gamma \max_{a'} \hat{q}(S', a', w) - \hat{q}(S, A, w)]\nabla\hat{q}(S, A, w)w←w+α[R+γmaxa′​q^​(S′,a′,w)−q^​(S,A,w)]∇q^​(S,A,w)

Exploration under function approximation

  • Optimistic initial values

    • Generalization between states means that we may not explore all of them

    • Not clear how to set for non-linear functions

  • Epsilon-greedy

    • Continues to explore, but less systematically than optimistic initial values

Average reward

  • The average reward approach maximizes r(π)=lim⁡h→∞1hE[∑t=1hRt]r(\pi) = \lim_{h \to \infty} \frac{1}{h} \mathbb{E}\left[\sum_{t=1}^h R_t\right]r(π)=limh→∞​h1​E[∑t=1h​Rt​]

  • Differential return Gt=(Rt+1−r(π))+(Rt+2−r(π))+(Rt+3−r(π))+...G_t = (R_{t+1} - r(\pi)) + (R_{t+2} - r(\pi)) + (R_{t+3} - r(\pi)) + ...Gt​=(Rt+1​−r(π))+(Rt+2​−r(π))+(Rt+3​−r(π))+...

Policy gradient

Learning parameterized policies

  • Rather than learning action values, we can learn the policy π(a∣s,θ)\pi(a \mid s, \theta)π(a∣s,θ) directly

    • This must be a valid probability distribution

    • One approach is to use the softmax function π(a∣s,θ)=eh(s,a,θ)∑b∈Aeh(s,b,θ)π(a|s,θ) = \frac{e^{h(s,a,θ)}}{\sum_{b∈A} e^{h(s,b,θ)}}π(a∣s,θ)=∑b∈A​eh(s,b,θ)eh(s,a,θ)​ to convert action preferences h(s,a,θ)h(s,a,\theta)h(s,a,θ) into probabilities

  • Advantages

    • Direct action selection, no action search needed

    • Works well with continuous actions

    • Natural handling of randomized policies

  • Disadvantages

    • Gets stuck in local optima

    • Needs more training samples

    • Policy structure choice is critical

    • Hard to control exploration

Policy gradients for continuing tasks

  • Policy-gradient method for average reward

    • ∇r(π)=∇∑sμ(s)∑aπ(a∣s,θ)∑s′,rp(s′,r∣s,a)r\nabla r(\pi) = \nabla \sum_{s} \mu(s) \sum_{a} \pi(a|s,\theta) \sum_{s',r} p(s',r|s,a)r∇r(π)=∇∑s​μ(s)∑a​π(a∣s,θ)∑s′,r​p(s′,r∣s,a)r

  • μ(s)\mu(s)μ(s) depends of π\piπ, but the policy gradient theorem gives us a way to calculate the derivative

    • ∇r(π)=∑sμ(s)∑a∇π(a∣s,θ)qπ(s,a)\nabla r(\pi) = \sum_{s} \mu(s) \sum_{a} \nabla \pi(a|s,\theta)q_\pi(s,a)∇r(π)=∑s​μ(s)∑a​∇π(a∣s,θ)qπ​(s,a)

    • Derived from the product rule

Actor-critic for continuing tasks

  • We can update θ\thetaθ using stochastic gradient ascent

    • θt+1=θt+α∇π(At∣St,θt)π(At∣St,θt)qπ(St,At)\theta_{t+1} = \theta_t + \alpha \frac{\nabla \pi(A_t|S_t,\theta_t)}{\pi(A_t|S_t,\theta_t)} q_\pi(S_t,A_t)θt+1​=θt​+απ(At​∣St​,θt​)∇π(At​∣St​,θt​)​qπ​(St​,At​)

    • θt+1=θt+α∇ln⁡π(At∣St,θt)qπ(St,At)\theta_{t+1} = \theta_t + \alpha \nabla \ln \pi(A_t|S_t,\theta_t)q_\pi(S_t,A_t)θt+1​=θt​+α∇lnπ(At​∣St​,θt​)qπ​(St​,At​)

  • Actor-critic policy gradient update rule

    • θt+1=θt+α∇ln⁡π(At∣St,θt)[Rt+1−Rˉ+v^(St+1,w)−v^(St,w)]\theta_{t+1} = \theta_t + \alpha \nabla \ln \pi(A_t|S_t,\theta_t)[R_{t+1} - \bar{R} + \hat{v}(S_{t+1},\mathbf{w}) - \hat{v}(S_t,\mathbf{w})]θt+1​=θt​+α∇lnπ(At​∣St​,θt​)[Rt+1​−Rˉ+v^(St+1​,w)−v^(St​,w)]

Last updated 6 months ago