Prediction and control with function approximation
On-policy prediction with approximation
Estimating value functions with supervised learning
In linear value feature approximation, the state value is approximated as a linear combination of state features
The tabular approach is a special case where is an indicator function
We want to be able to generalize from similar states, but still discriminate between states where relevant
Tabular methods have perfect discrimination but no generalization
Aggregating all states would result in no discrimination
The objective for on-policy prediction
Mean squared value error
represents the importance of a given state, which will generally be the proportion of time spent there
We can optimize the objective function by combining stochastic gradient descent with Monte Carlo return estimates
State aggregation partitions the states into sets whose values are assumed to be equal
You're basically approximating the table by a smaller table
Like tabular methods, this is a special case of linear value feature approximation
The objective for TD
In semi-gradient TD(0), we use the TD(0) estimate for the return in place of
This is a semi-gradient method as it treats the return estimate as independent of
This estimate is biased, but it converges faster than the Monte Carlo approach
Linear TD
With linear approximation, this will converge to a point where the expected update is zero (the TD fixed point)
This will be close to the true minimum if is small
The situation with non-linear function approximation can be messier (e.g., multiple fixed points)
Constructing features for prediction
Feature construction for linear methods
Coarse coding is a generalization of state aggregation where the receptive fields (regions) can overlap, allowing better discrimination
Tile coding is a type of coarse coding based on overlapping tilings
The tiles are usually square, in which case this approach is computationally efficient
Control with approximation
SARSA
Expected SARSA
Q-learning
Exploration under function approximation
Optimistic initial values
Generalization between states means that we may not explore all of them
Not clear how to set for non-linear functions
Epsilon-greedy
Continues to explore, but less systematically than optimistic initial values
Average reward
The average reward approach maximizes
Differential return
Policy gradient
Learning parameterized policies
Rather than learning action values, we can learn the policy directly
This must be a valid probability distribution
One approach is to use the softmax function to convert action preferences into probabilities
Advantages
Direct action selection, no action search needed
Works well with continuous actions
Natural handling of randomized policies
Disadvantages
Gets stuck in local optima
Needs more training samples
Policy structure choice is critical
Hard to control exploration
Policy gradients for continuing tasks
Policy-gradient method for average reward
depends of , but the policy gradient theorem gives us a way to calculate the derivative
Derived from the product rule
Actor-critic for continuing tasks
We can update using stochastic gradient ascent
Actor-critic policy gradient update rule
Last updated