Prediction and control with function approximation

On-policy prediction with approximation

Estimating value functions with supervised learning

  • In linear value feature approximation, the state value is approximated as a linear combination of state features

    • v^(s,w)iwixi(s)=w,x(s)\hat{v}(s, w) \doteq \sum_i w_i x_i(s) = \langle w, x(s) \rangle

    • The tabular approach is a special case where xx is an indicator function

  • We want to be able to generalize from similar states, but still discriminate between states where relevant

    • Tabular methods have perfect discrimination but no generalization

    • Aggregating all states would result in no discrimination

The objective for on-policy prediction

  • Mean squared value error

    • sμ(s)[vπ(s)v^(s,w)]2\sum_s \mu(s)[v_\pi(s) - \hat{v}(s, w)]^2

    • μ(s)\mu(s) represents the importance of a given state, which will generally be the proportion of time spent there

  • We can optimize the objective function by combining stochastic gradient descent with Monte Carlo return estimates

    • wt+1wt+α[Gtv^(St,wt)]v^(s,wt)w_{t+1} \doteq w_t + \alpha[G_t - \hat{v}(S_t, w_t)]\nabla\hat{v}(s, w_t)

  • State aggregation partitions the states into sets whose values are assumed to be equal

    • You're basically approximating the table by a smaller table

    • Like tabular methods, this is a special case of linear value feature approximation

The objective for TD

  • In semi-gradient TD(0), we use the TD(0) estimate for the return in place of GtG_t

  • wt+1wt+α[Rt+γv^(St+1,wt)v^(St,wt)]v^(s,wt)w_{t+1} \doteq w_t + \alpha[R_t +\gamma \hat{v}(S_{t+1}, w_t) - \hat{v}(S_t, w_t)]\nabla\hat{v}(s, w_t)

  • This is a semi-gradient method as it treats the return estimate as independent of ww

  • This estimate is biased, but it converges faster than the Monte Carlo approach

Linear TD

  • With linear approximation, this will converge to a point where the expected update is zero (the TD fixed point)

  • E[ΔwTD]=α(bAwTD)=0\mathbb{E}[\Delta w_{TD}] = \alpha(b - Aw_{TD}) = 0

  • This will be close to the true minimum if γ\gamma is small

  • The situation with non-linear function approximation can be messier (e.g., multiple fixed points)

Constructing features for prediction

Feature construction for linear methods

  • Coarse coding is a generalization of state aggregation where the receptive fields (regions) can overlap, allowing better discrimination

  • Tile coding is a type of coarse coding based on overlapping tilings

  • The tiles are usually square, in which case this approach is computationally efficient

Control with approximation

  • SARSA

    • ww+α[R+γq^(S,A,w)q^(S,A,w)]q^(S,A,w)w \leftarrow w + \alpha[R + \gamma \hat{q}(S', A', w) - \hat{q}(S, A, w)]\nabla\hat{q}(S, A, w)

  • Expected SARSA

    • ww+α[R+γaπ(aS)q^(S,a,w)q^(S,A,w)]q^(S,A,w)w \leftarrow w + \alpha[R + \gamma \sum_{a'} \pi(a'|S')\hat{q}(S', a', w) - \hat{q}(S, A, w)]\nabla\hat{q}(S, A, w)

  • Q-learning

    • ww+α[R+γmaxaq^(S,a,w)q^(S,A,w)]q^(S,A,w)w \leftarrow w + \alpha[R + \gamma \max_{a'} \hat{q}(S', a', w) - \hat{q}(S, A, w)]\nabla\hat{q}(S, A, w)

Exploration under function approximation

  • Optimistic initial values

    • Generalization between states means that we may not explore all of them

    • Not clear how to set for non-linear functions

  • Epsilon-greedy

    • Continues to explore, but less systematically than optimistic initial values

Average reward

  • The average reward approach maximizes r(π)=limh1hE[t=1hRt]r(\pi) = \lim_{h \to \infty} \frac{1}{h} \mathbb{E}\left[\sum_{t=1}^h R_t\right]

  • Differential return Gt=(Rt+1r(π))+(Rt+2r(π))+(Rt+3r(π))+...G_t = (R_{t+1} - r(\pi)) + (R_{t+2} - r(\pi)) + (R_{t+3} - r(\pi)) + ...

Policy gradient

Learning parameterized policies

  • Rather than learning action values, we can learn the policy π(as,θ)\pi(a \mid s, \theta) directly

    • This must be a valid probability distribution

    • One approach is to use the softmax function π(as,θ)=eh(s,a,θ)bAeh(s,b,θ)π(a|s,θ) = \frac{e^{h(s,a,θ)}}{\sum_{b∈A} e^{h(s,b,θ)}} to convert action preferences h(s,a,θ)h(s,a,\theta) into probabilities

  • Advantages

    • Direct action selection, no action search needed

    • Works well with continuous actions

    • Natural handling of randomized policies

  • Disadvantages

    • Gets stuck in local optima

    • Needs more training samples

    • Policy structure choice is critical

    • Hard to control exploration

Policy gradients for continuing tasks

  • Policy-gradient method for average reward

    • r(π)=sμ(s)aπ(as,θ)s,rp(s,rs,a)r\nabla r(\pi) = \nabla \sum_{s} \mu(s) \sum_{a} \pi(a|s,\theta) \sum_{s',r} p(s',r|s,a)r

  • μ(s)\mu(s) depends of π\pi, but the policy gradient theorem gives us a way to calculate the derivative

    • r(π)=sμ(s)aπ(as,θ)qπ(s,a)\nabla r(\pi) = \sum_{s} \mu(s) \sum_{a} \nabla \pi(a|s,\theta)q_\pi(s,a)

    • Derived from the product rule

Actor-critic for continuing tasks

  • We can update θ\theta using stochastic gradient ascent

    • θt+1=θt+απ(AtSt,θt)π(AtSt,θt)qπ(St,At)\theta_{t+1} = \theta_t + \alpha \frac{\nabla \pi(A_t|S_t,\theta_t)}{\pi(A_t|S_t,\theta_t)} q_\pi(S_t,A_t)

    • θt+1=θt+αlnπ(AtSt,θt)qπ(St,At)\theta_{t+1} = \theta_t + \alpha \nabla \ln \pi(A_t|S_t,\theta_t)q_\pi(S_t,A_t)

  • Actor-critic policy gradient update rule

    • θt+1=θt+αlnπ(AtSt,θt)[Rt+1Rˉ+v^(St+1,w)v^(St,w)]\theta_{t+1} = \theta_t + \alpha \nabla \ln \pi(A_t|S_t,\theta_t)[R_{t+1} - \bar{R} + \hat{v}(S_{t+1},\mathbf{w}) - \hat{v}(S_t,\mathbf{w})]

Last updated