Prediction and control with function approximation

On-policy prediction with approximation

In linear value feature approximation, the state value is approximated as a linear combination of state features
- $\hat{v}(s, w) \doteq \sum_i w_i x_i(s) = \langle w, x(s) \rangle$
- The tabular approach is a special case where $x$ is an indicator function
We want to be able to generalize from similar states, but still discriminate between states where relevant
- Tabular methods have perfect discrimination but no generalization
- Aggregating all states would result in no discrimination

Mean squared value error
- $\sum_s \mu(s)[v_\pi(s) - \hat{v}(s, w)]^2$
- $\mu(s)$ represents the importance of a given state, which will generally be the proportion of time spent there
We can optimize the objective function by combining stochastic gradient descent with Monte Carlo return estimates
- $w_{t+1} \doteq w_t + \alpha[G_t - \hat{v}(S_t, w_t)]\nabla\hat{v}(s, w_t)$
State aggregation partitions the states into sets whose values are assumed to be equal
- You're basically approximating the table by a smaller table
- Like tabular methods, this is a special case of linear value feature approximation

In semi-gradient TD(0), we use the TD(0) estimate for the return in place of $G_t$
$w_{t+1} \doteq w_t + \alpha[R_t +\gamma \hat{v}(S_{t+1}, w_t) - \hat{v}(S_t, w_t)]\nabla\hat{v}(s, w_t)$
This is a semi-gradient method as it treats the return estimate as independent of $w$
This estimate is biased, but it converges faster than the Monte Carlo approach

With linear approximation, this will converge to a point where the expected update is zero (the TD fixed point)
$\mathbb{E}[\Delta w_{TD}] = \alpha(b - Aw_{TD}) = 0$
This will be close to the true minimum if $\gamma$ is small
The situation with non-linear function approximation can be messier (e.g., multiple fixed points)

Coarse coding is a generalization of state aggregation where the receptive fields (regions) can overlap, allowing better discrimination
Tile coding is a type of coarse coding based on overlapping tilings
The tiles are usually square, in which case this approach is computationally efficient

SARSA
- $w \leftarrow w + \alpha[R + \gamma \hat{q}(S', A', w) - \hat{q}(S, A, w)]\nabla\hat{q}(S, A, w)$
Expected SARSA
- $w \leftarrow w + \alpha[R + \gamma \sum_{a'} \pi(a'|S')\hat{q}(S', a', w) - \hat{q}(S, A, w)]\nabla\hat{q}(S, A, w)$
Q-learning
- $w \leftarrow w + \alpha[R + \gamma \max_{a'} \hat{q}(S', a', w) - \hat{q}(S, A, w)]\nabla\hat{q}(S, A, w)$

Optimistic initial values
- Generalization between states means that we may not explore all of them
- Not clear how to set for non-linear functions
Epsilon-greedy
- Continues to explore, but less systematically than optimistic initial values

The average reward approach maximizes $r(\pi) = \lim_{h \to \infty} \frac{1}{h} \mathbb{E}\left[\sum_{t=1}^h R_t\right]$
Differential return $G_t = (R_{t+1} - r(\pi)) + (R_{t+2} - r(\pi)) + (R_{t+3} - r(\pi)) + ...$

Rather than learning action values, we can learn the policy $\pi(a \mid s, \theta)$ directly
- This must be a valid probability distribution
- One approach is to use the softmax function $π(a|s,θ) = \frac{e^{h(s,a,θ)}}{\sum_{b∈A} e^{h(s,b,θ)}}$ to convert action preferences $h(s,a,\theta)$ into probabilities
Advantages
- Direct action selection, no action search needed
- Works well with continuous actions
- Natural handling of randomized policies
Disadvantages
- Gets stuck in local optima
- Needs more training samples
- Policy structure choice is critical
- Hard to control exploration

Policy-gradient method for average reward
- $\nabla r(\pi) = \nabla \sum_{s} \mu(s) \sum_{a} \pi(a|s,\theta) \sum_{s',r} p(s',r|s,a)r$
$\mu(s)$ depends of $\pi$ , but the policy gradient theorem gives us a way to calculate the derivative
- $\nabla r(\pi) = \sum_{s} \mu(s) \sum_{a} \nabla \pi(a|s,\theta)q_\pi(s,a)$
- Derived from the product rule

We can update $\theta$ using stochastic gradient ascent
- $\theta_{t+1} = \theta_t + \alpha \frac{\nabla \pi(A_t|S_t,\theta_t)}{\pi(A_t|S_t,\theta_t)} q_\pi(S_t,A_t)$
- $\theta_{t+1} = \theta_t + \alpha \nabla \ln \pi(A_t|S_t,\theta_t)q_\pi(S_t,A_t)$
Actor-critic policy gradient update rule
- $\theta_{t+1} = \theta_t + \alpha \nabla \ln \pi(A_t|S_t,\theta_t)[R_{t+1} - \bar{R} + \hat{v}(S_{t+1},\mathbf{w}) - \hat{v}(S_t,\mathbf{w})]$

Last updated 7 months ago