Fundamentals of reinforcement learning

An introduction to sequential decision making

A k-armed bandit is a sequential decision-making problem where you repeatedly choose between k different options (arms) with unknown reward distributions
The value of an action is defined to be the expected reward
Our aim is to choose the action that maximizes this

The sample-average method estimates the action value as the sample mean
The greedy action is the one with the highest estimated value
The sample mean can be updated incrementally, so we don't need to store all the observations
This can be generalized to $Q_{n+1}=Q_n+\alpha_n(R_n-Q_n)=\alpha_nR_n+(1-\alpha_n)Q_n$
- $\alpha_n$ is known as the step size
- For the sample-average method, $\alpha_n=1/n$
- A constant step size gives exponential smoothing

Epsilon-greedy strategy
- Explore with probability $\varepsilon$ and exploit with probability $1-\varepsilon$
Optimistic initial values
- Setting $Q_0$ really high for all actions results in exploration at the beginning transitioning to exploitation later on
- Not well suited for non-stationary problems
- May not know what would count as optimistic until we've seen actual rewards
Upper-confidence bound
- Choose the action whose confidence interval has the greatest upper bound
- $A_t \doteq \argmax_{a} \left[Q_t(a) + c\sqrt{\frac{\ln t}{N_t(a)}}\right]$
- This encourages exploring actions that may have high value
A contextual bandit is an extension of the k-armed bandit where the agent receives additional contextual information (features/state) before making each decision, and the rewards depend on both the chosen action and the context

Markov decision processes introduce the idea of a state, with the new state a non-deterministic function of the action and the current state (but not earlier states; the process has the Markov property)
This is still a sequential process with discrete time steps

Return is defined here as the sum of the expected rewards
This assumes that the task is episodic (they reach some terminal state); the sum may not converge otherwise
Reward hypothesis: "All goals can be described by the maximization of expected cumulative rewards"
Goals as rewards
- Goal-reward representation: 1 if goal met, 0 otherwise
- Action-penalty representation: -1 for each action that doesn't achieve the goal
  - Provides an incentive to reach the goal more quickly, but runs into issues if it's possible to get stuck and never achieve the goal
- It may be difficult for the agent to learn without intermediate rewards (if the search space is too large then it may never meet the goal without some guidance along the way)
Reward sources
- Programming
  - Coding
  - Human in the loop
    Humans may assign rewards in a non-stationary way, which can cause issues
- Examples
  - Mimic reward
  - Inverse reinforcement learning: Reward function is inferred from actions (so we're going from behaviour to rewards rather than from rewards to behaviour)
- Derived indirectly through optimization
  - Evolutionary optimization
  - Meta reinforcement learning (learning algorithm is learned by training across multiple tasks)

Continuing tasks have no terminal states and cannot be broken up into independent episodes
Adding a discount factor $\gamma$ to the sum in the return calculation ensures that the sum still converges

Deterministic policy: $\pi(s)=a$
Stochastic policy: $\pi(a\mid s)$
State-value function: $v_\pi(s) \doteq \mathbb{E}_\pi[G_t | S_t = s]$
- That is, $v_\pi(s)$ is the expected return starting from state $s$ and following policy $\pi$
Action-value function: $q_{\pi}(s,a) \doteq \mathbb{E}_{\pi}[G_t | S_t = s, A_t = a]$

V_{\pi}(s) = \sum_a \pi(a|s) \sum_r \sum_{s'} p(s',r|s,a)[r + \gamma V_{\pi}(s')]

q_{\pi}(s,a) = \sum_r \sum_{s'} p(s',r|s,a)[r + \gamma \sum_{a'} \pi(a'|s')q_{\pi}(s',a')]

Backup (or update) operations transfer value information back to a state (or state-action pair) from its successor states (or state-action pairs)

Policy evaluation: Determining a value function for a specific policy
Control: Finding a policy that maximizes the return
Planning method: An approach that uses a model of the environment to improve a policy or value function without direct interaction with the environment
Dynamic programming algorithms use the Bellman equations to define iterative algorithms for policy evaluation and control
- To do this, we need to know the dynamics function $p$
Iterative policy evaluation
- Update the estimate of the value function for each state (one sweep)
- Repeat until convergence
- There is a single-array alternative where each update uses the latest values rather than updating them all at the end of the sweep
  - Still guaranteed to converge, and may do so faster

Policy improvement theorem: If we take a policy and modify it to be greedy with respect to its value function, the new policy will be at least as good as the original one (and strictly better if there's any room for improvement)
Policy iteration alternates between evaluation and improvement in the obvious way

You don't necessarily need to run policy evaluation to convergence before doing policy improvement
Similarly, you don't necessarily need to full policy improvement before performing policy evaluation
Value iteration applies a single sweep in each evaluation step
This sweeps the entire state space on each iteration (synchronous dynamic programming)
- May be impractical if the state space is large
Asynchronous dynamic programming updates the estimates for a subset of states at each evaluation
- Potential to be selective (e.g., update states whose neighbours have had particularly large updates)
Using the value estimates of successor states to improve the current value estimate is known as bootstrapping
Monte Carlo methods are less efficient as the value function is updated independently for each state, though they do have some advantages such as:
- No need for model knowledge
- No bootstrap error propagation
Brute force search of deterministic policies is also possible, but generally impractical
- $O(|A|^{|S|})$ compared to $O(|S|²|A|)$ per iteration for dynamic programming

Last updated 7 months ago