Prediction and control with function approximation
On-policy prediction with approximation
Estimating value functions with supervised learning
In linear value feature approximation, the state value is approximated as a linear combination of state features
v^(s,w)≐∑iwixi(s)=⟨w,x(s)⟩
The tabular approach is a special case where x is an indicator function
We want to be able to generalize from similar states, but still discriminate between states where relevant
Tabular methods have perfect discrimination but no generalization
Aggregating all states would result in no discrimination
The objective for on-policy prediction
Mean squared value error
∑sμ(s)[vπ(s)−v^(s,w)]2
μ(s) represents the importance of a given state, which will generally be the proportion of time spent there
We can optimize the objective function by combining stochastic gradient descent with Monte Carlo return estimates
wt+1≐wt+α[Gt−v^(St,wt)]∇v^(s,wt)
State aggregation partitions the states into sets whose values are assumed to be equal
You're basically approximating the table by a smaller table
Like tabular methods, this is a special case of linear value feature approximation
The objective for TD
In semi-gradient TD(0), we use the TD(0) estimate for the return in place of Gt
wt+1≐wt+α[Rt+γv^(St+1,wt)−v^(St,wt)]∇v^(s,wt)
This is a semi-gradient method as it treats the return estimate as independent of w
This estimate is biased, but it converges faster than the Monte Carlo approach
Linear TD
With linear approximation, this will converge to a point where the expected update is zero (the TD fixed point)
E[ΔwTD]=α(b−AwTD)=0
This will be close to the true minimum if γ is small
The situation with non-linear function approximation can be messier (e.g., multiple fixed points)
Constructing features for prediction
Feature construction for linear methods
Coarse coding is a generalization of state aggregation where the receptive fields (regions) can overlap, allowing better discrimination
Tile coding is a type of coarse coding based on overlapping tilings
The tiles are usually square, in which case this approach is computationally efficient
Control with approximation
SARSA
w←w+α[R+γq^(S′,A′,w)−q^(S,A,w)]∇q^(S,A,w)
Expected SARSA
w←w+α[R+γ∑a′π(a′∣S′)q^(S′,a′,w)−q^(S,A,w)]∇q^(S,A,w)
Q-learning
w←w+α[R+γmaxa′q^(S′,a′,w)−q^(S,A,w)]∇q^(S,A,w)
Exploration under function approximation
Optimistic initial values
Generalization between states means that we may not explore all of them
Not clear how to set for non-linear functions
Epsilon-greedy
Continues to explore, but less systematically than optimistic initial values
Average reward
The average reward approach maximizes r(π)=limh→∞h1E[∑t=1hRt]
Differential return Gt=(Rt+1−r(π))+(Rt+2−r(π))+(Rt+3−r(π))+...
Policy gradient
Learning parameterized policies
Rather than learning action values, we can learn the policy π(a∣s,θ) directly
This must be a valid probability distribution
One approach is to use the softmax function π(a∣s,θ)=∑b∈Aeh(s,b,θ)eh(s,a,θ) to convert action preferences h(s,a,θ) into probabilities
Advantages
Direct action selection, no action search needed
Works well with continuous actions
Natural handling of randomized policies
Disadvantages
Gets stuck in local optima
Needs more training samples
Policy structure choice is critical
Hard to control exploration
Policy gradients for continuing tasks
Policy-gradient method for average reward
∇r(π)=∇∑sμ(s)∑aπ(a∣s,θ)∑s′,rp(s′,r∣s,a)r
μ(s) depends of π, but the policy gradient theorem gives us a way to calculate the derivative
∇r(π)=∑sμ(s)∑a∇π(a∣s,θ)qπ(s,a)
Derived from the product rule
Actor-critic for continuing tasks
We can update θ using stochastic gradient ascent
θt+1=θt+απ(At∣St,θt)∇π(At∣St,θt)qπ(St,At)
θt+1=θt+α∇lnπ(At∣St,θt)qπ(St,At)
Actor-critic policy gradient update rule
θt+1=θt+α∇lnπ(At∣St,θt)[Rt+1−Rˉ+v^(St+1,w)−v^(St,w)]
Last updated