Fundamentals of reinforcement learning
An introduction to sequential decision making
The k-armed bandit problem
A k-armed bandit is a sequential decision-making problem where you repeatedly choose between k different options (arms) with unknown reward distributions
The value of an action is defined to be the expected reward
Our aim is to choose the action that maximizes this
What to learn? Estimating action values
The sample-average method estimates the action value as the sample mean
The greedy action is the one with the highest estimated value
The sample mean can be updated incrementally, so we don't need to store all the observations
This can be generalized to
is known as the step size
For the sample-average method,
A constant step size gives exponential smoothing
Exploration vs. exploitation trade-off
Epsilon-greedy strategy
Explore with probability and exploit with probability
Optimistic initial values
Setting really high for all actions results in exploration at the beginning transitioning to exploitation later on
Not well suited for non-stationary problems
May not know what would count as optimistic until we've seen actual rewards
Upper-confidence bound
Choose the action whose confidence interval has the greatest upper bound
This encourages exploring actions that may have high value
A contextual bandit is an extension of the k-armed bandit where the agent receives additional contextual information (features/state) before making each decision, and the rewards depend on both the chosen action and the context
Markov decision processes
Introduction to Markov decision processes
Markov decision processes introduce the idea of a state, with the new state a non-deterministic function of the action and the current state (but not earlier states; the process has the Markov property)
This is still a sequential process with discrete time steps
Goal of reinforcement learning
Return is defined here as the sum of the expected rewards
This assumes that the task is episodic (they reach some terminal state); the sum may not converge otherwise
Reward hypothesis: "All goals can be described by the maximization of expected cumulative rewards"
Goals as rewards
Goal-reward representation: 1 if goal met, 0 otherwise
Action-penalty representation: -1 for each action that doesn't achieve the goal
Provides an incentive to reach the goal more quickly, but runs into issues if it's possible to get stuck and never achieve the goal
It may be difficult for the agent to learn without intermediate rewards (if the search space is too large then it may never meet the goal without some guidance along the way)
Reward sources
Programming
Coding
Human in the loop
Humans may assign rewards in a non-stationary way, which can cause issues
Examples
Mimic reward
Inverse reinforcement learning: Reward function is inferred from actions (so we're going from behaviour to rewards rather than from rewards to behaviour)
Derived indirectly through optimization
Evolutionary optimization
Meta reinforcement learning (learning algorithm is learned by training across multiple tasks)
Continuing tasks
Continuing tasks have no terminal states and cannot be broken up into independent episodes
Adding a discount factor to the sum in the return calculation ensures that the sum still converges
Value functions and Bellman equations
Policies and value functions
Deterministic policy:
Stochastic policy:
State-value function:
That is, is the expected return starting from state and following policy
Action-value function:
Bellman equations
State-value Bellman equation:
Action-value Bellman equation:
Backup (or update) operations transfer value information back to a state (or state-action pair) from its successor states (or state-action pairs)
Optimality
Optimal policy
Dominates all other policies
May not be unique
Dynamic programming
Policy evaluation (prediction)
Policy evaluation: Determining a value function for a specific policy
Control: Finding a policy that maximizes the return
Planning method: An approach that uses a model of the environment to improve a policy or value function without direct interaction with the environment
Dynamic programming algorithms use the Bellman equations to define iterative algorithms for policy evaluation and control
To do this, we need to know the dynamics function
Iterative policy evaluation
Update the estimate of the value function for each state (one sweep)
Repeat until convergence
There is a single-array alternative where each update uses the latest values rather than updating them all at the end of the sweep
Still guaranteed to converge, and may do so faster
Policy iteration (control)
Policy improvement theorem: If we take a policy and modify it to be greedy with respect to its value function, the new policy will be at least as good as the original one (and strictly better if there's any room for improvement)
Policy iteration alternates between evaluation and improvement in the obvious way
Generalized policy iteration
You don't necessarily need to run policy evaluation to convergence before doing policy improvement
Similarly, you don't necessarily need to full policy improvement before performing policy evaluation
Value iteration applies a single sweep in each evaluation step
This sweeps the entire state space on each iteration (synchronous dynamic programming)
May be impractical if the state space is large
Asynchronous dynamic programming updates the estimates for a subset of states at each evaluation
Potential to be selective (e.g., update states whose neighbours have had particularly large updates)
Using the value estimates of successor states to improve the current value estimate is known as bootstrapping
Monte Carlo methods are less efficient as the value function is updated independently for each state, though they do have some advantages such as:
No need for model knowledge
No bootstrap error propagation
Brute force search of deterministic policies is also possible, but generally impractical
compared to per iteration for dynamic programming
Last updated