Raoul Harris
  • Introduction
  • Technical books
    • Data engineering with Alteryx
    • Deep learning in Python
    • Generative AI in action
    • Generative deep learning
    • Outlier analysis
    • Understanding deep learning
    • Understanding machine learning: from theory to algorithms (in progress)
    • Review: Deep learning: foundations and concepts
  • Technical courses
    • Advanced SQL Server masterclass for data analytics
    • Building full-stack apps with AI
    • Complete Cursor
    • DataOps methodology
    • DeepLearning.AI short courses
    • Generative AI for software development
      • Introduction to generative AI for software development
      • Team software engineering with AI
      • AI-powered software and system design
    • Generative AI with large language models
    • Generative pre-trained transformers
    • IBM DevOps and software engineering
      • Introduction to agile development and scrum
      • Introduction to cloud computing
      • Introduction to DevOps
    • Machine learning in production
    • Reinforcement learning specialization
      • Fundamentals of reinforcement learning
      • Sample-based learning methods
      • Prediction and control with function approximation
  • Non-technical books
    • Management skills for everyday life (in progress)
  • Non-technical courses
    • Business communication and effective communication specializations
      • Business writing
      • Graphic design
      • Successful presentation
      • Giving helpful feedback (not started)
      • Communicating effectively in groups (not started)
    • Illinois Tech MBA courses
      • Competitive strategy (in progress)
    • Leading people and teams specialization
      • Inspiring and motivating individuals
      • Managing talent
      • Influencing people
      • Leading teams
Powered by GitBook
On this page
  • An introduction to sequential decision making
  • The k-armed bandit problem
  • What to learn? Estimating action values
  • Exploration vs. exploitation trade-off
  • Markov decision processes
  • Introduction to Markov decision processes
  • Goal of reinforcement learning
  • Continuing tasks
  • Value functions and Bellman equations
  • Policies and value functions
  • Bellman equations
  • Optimality
  • Dynamic programming
  • Policy evaluation (prediction)
  • Policy iteration (control)
  • Generalized policy iteration
  1. Technical courses
  2. Reinforcement learning specialization

Fundamentals of reinforcement learning

An introduction to sequential decision making

The k-armed bandit problem

  • A k-armed bandit is a sequential decision-making problem where you repeatedly choose between k different options (arms) with unknown reward distributions

  • The value of an action is defined to be the expected reward

  • Our aim is to choose the action that maximizes this

What to learn? Estimating action values

  • The sample-average method estimates the action value as the sample mean

  • The greedy action is the one with the highest estimated value

  • The sample mean can be updated incrementally, so we don't need to store all the observations

  • This can be generalized to Qn+1=Qn+αn(Rn−Qn)=αnRn+(1−αn)QnQ_{n+1}=Q_n+\alpha_n(R_n-Q_n)=\alpha_nR_n+(1-\alpha_n)Q_nQn+1​=Qn​+αn​(Rn​−Qn​)=αn​Rn​+(1−αn​)Qn​

    • αn\alpha_nαn​ is known as the step size

    • For the sample-average method, αn=1/n\alpha_n=1/nαn​=1/n

    • A constant step size gives exponential smoothing

Exploration vs. exploitation trade-off

  • Epsilon-greedy strategy

    • Explore with probability ε\varepsilonε and exploit with probability 1−ε1-\varepsilon1−ε

  • Optimistic initial values

    • Setting Q0Q_0Q0​ really high for all actions results in exploration at the beginning transitioning to exploitation later on

    • Not well suited for non-stationary problems

    • May not know what would count as optimistic until we've seen actual rewards

  • Upper-confidence bound

    • Choose the action whose confidence interval has the greatest upper bound

    • At≐arg max⁡a[Qt(a)+cln⁡tNt(a)]A_t \doteq \argmax_{a} \left[Q_t(a) + c\sqrt{\frac{\ln t}{N_t(a)}}\right]At​≐argmaxa​[Qt​(a)+cNt​(a)lnt​​]

    • This encourages exploring actions that may have high value

  • A contextual bandit is an extension of the k-armed bandit where the agent receives additional contextual information (features/state) before making each decision, and the rewards depend on both the chosen action and the context

Markov decision processes

Introduction to Markov decision processes

  • Markov decision processes introduce the idea of a state, with the new state a non-deterministic function of the action and the current state (but not earlier states; the process has the Markov property)

  • This is still a sequential process with discrete time steps

Goal of reinforcement learning

  • Return is defined here as the sum of the expected rewards

  • This assumes that the task is episodic (they reach some terminal state); the sum may not converge otherwise

  • Reward hypothesis: "All goals can be described by the maximization of expected cumulative rewards"

  • Goals as rewards

    • Goal-reward representation: 1 if goal met, 0 otherwise

    • Action-penalty representation: -1 for each action that doesn't achieve the goal

      • Provides an incentive to reach the goal more quickly, but runs into issues if it's possible to get stuck and never achieve the goal

    • It may be difficult for the agent to learn without intermediate rewards (if the search space is too large then it may never meet the goal without some guidance along the way)

  • Reward sources

    • Programming

      • Coding

      • Human in the loop

        • Humans may assign rewards in a non-stationary way, which can cause issues

    • Examples

      • Mimic reward

      • Inverse reinforcement learning: Reward function is inferred from actions (so we're going from behaviour to rewards rather than from rewards to behaviour)

    • Derived indirectly through optimization

      • Evolutionary optimization

      • Meta reinforcement learning (learning algorithm is learned by training across multiple tasks)

Continuing tasks

  • Continuing tasks have no terminal states and cannot be broken up into independent episodes

  • Adding a discount factor γ\gammaγ to the sum in the return calculation ensures that the sum still converges

Value functions and Bellman equations

Policies and value functions

  • Deterministic policy: π(s)=a\pi(s)=aπ(s)=a

  • Stochastic policy: π(a∣s)\pi(a\mid s)π(a∣s)

  • State-value function: vπ(s)≐Eπ[Gt∣St=s]v_\pi(s) \doteq \mathbb{E}_\pi[G_t | S_t = s]vπ​(s)≐Eπ​[Gt​∣St​=s]

    • That is, vπ(s)v_\pi(s)vπ​(s) is the expected return starting from state sss and following policy π\piπ

  • Action-value function: qπ(s,a)≐Eπ[Gt∣St=s,At=a]q_{\pi}(s,a) \doteq \mathbb{E}_{\pi}[G_t | S_t = s, A_t = a]qπ​(s,a)≐Eπ​[Gt​∣St​=s,At​=a]

Bellman equations

  • State-value Bellman equation:

Vπ(s)=∑aπ(a∣s)∑r∑s′p(s′,r∣s,a)[r+γVπ(s′)]V_{\pi}(s) = \sum_a \pi(a|s) \sum_r \sum_{s'} p(s',r|s,a)[r + \gamma V_{\pi}(s')]Vπ​(s)=a∑​π(a∣s)r∑​s′∑​p(s′,r∣s,a)[r+γVπ​(s′)]
  • Action-value Bellman equation:

qπ(s,a)=∑r∑s′p(s′,r∣s,a)[r+γ∑a′π(a′∣s′)qπ(s′,a′)]q_{\pi}(s,a) = \sum_r \sum_{s'} p(s',r|s,a)[r + \gamma \sum_{a'} \pi(a'|s')q_{\pi}(s',a')]qπ​(s,a)=r∑​s′∑​p(s′,r∣s,a)[r+γa′∑​π(a′∣s′)qπ​(s′,a′)]
  • Backup (or update) operations transfer value information back to a state (or state-action pair) from its successor states (or state-action pairs)

Optimality

  • Optimal policy π∗\pi^*π∗

    • Dominates all other policies

    • May not be unique

  • v∗(s)≐max⁡πvπ(s)v_*(s) \doteq \max_{\pi} v_{\pi}(s)v∗​(s)≐maxπ​vπ​(s)

  • q∗(s,a)≐max⁡πqπ(s,a)q_*(s,a) \doteq \max_{\pi} q_{\pi}(s,a)q∗​(s,a)≐maxπ​qπ​(s,a)

  • v∗(s)=max⁡a∑s′∑rp(s′,r∣s,a)[r+γv∗(s′)]v_*(s) = \max_a \sum_{s'} \sum_r p(s',r|s,a)[r + \gamma v_*(s')]v∗​(s)=maxa​∑s′​∑r​p(s′,r∣s,a)[r+γv∗​(s′)]

  • q∗(s,a)=∑s′∑rp(s′,r∣s,a)[r+γmax⁡a′q∗(s′,a′)]q_*(s,a) = \sum_{s'} \sum_r p(s',r|s,a)[r + \gamma \max_{a'} q_*(s',a')]q∗​(s,a)=∑s′​∑r​p(s′,r∣s,a)[r+γmaxa′​q∗​(s′,a′)]

  • π∗(s)=arg max⁡a∑s′∑rp(s′,r∣s,a)[r+γv∗(s′)]\pi_*(s) = \argmax_a \sum_{s'} \sum_r p(s',r|s,a)[r + \gamma v_*(s')]π∗​(s)=argmaxa​∑s′​∑r​p(s′,r∣s,a)[r+γv∗​(s′)]

Dynamic programming

Policy evaluation (prediction)

  • Policy evaluation: Determining a value function for a specific policy

  • Control: Finding a policy that maximizes the return

  • Planning method: An approach that uses a model of the environment to improve a policy or value function without direct interaction with the environment

  • Dynamic programming algorithms use the Bellman equations to define iterative algorithms for policy evaluation and control

    • To do this, we need to know the dynamics function ppp

  • Iterative policy evaluation

    • Update the estimate of the value function for each state (one sweep)

    • Repeat until convergence

    • There is a single-array alternative where each update uses the latest values rather than updating them all at the end of the sweep

      • Still guaranteed to converge, and may do so faster

Policy iteration (control)

  • Policy improvement theorem: If we take a policy and modify it to be greedy with respect to its value function, the new policy will be at least as good as the original one (and strictly better if there's any room for improvement)

  • Policy iteration alternates between evaluation and improvement in the obvious way

Generalized policy iteration

  • You don't necessarily need to run policy evaluation to convergence before doing policy improvement

  • Similarly, you don't necessarily need to full policy improvement before performing policy evaluation

  • Value iteration applies a single sweep in each evaluation step

  • This sweeps the entire state space on each iteration (synchronous dynamic programming)

    • May be impractical if the state space is large

  • Asynchronous dynamic programming updates the estimates for a subset of states at each evaluation

    • Potential to be selective (e.g., update states whose neighbours have had particularly large updates)

  • Using the value estimates of successor states to improve the current value estimate is known as bootstrapping

  • Monte Carlo methods are less efficient as the value function is updated independently for each state, though they do have some advantages such as:

    • No need for model knowledge

    • No bootstrap error propagation

  • Brute force search of deterministic policies is also possible, but generally impractical

    • O(∣A∣∣S∣)O(|A|^{|S|})O(∣A∣∣S∣) compared to O(∣S∣2∣A∣)O(|S|²|A|)O(∣S∣2∣A∣) per iteration for dynamic programming

Last updated 6 months ago