Raoul Harris
  • Introduction
  • Technical books
    • Data engineering with Alteryx
    • Deep learning in Python
    • Generative AI in action
    • Generative deep learning
    • Outlier analysis
    • Understanding deep learning
    • Understanding machine learning: from theory to algorithms (in progress)
    • Review: Deep learning: foundations and concepts
  • Technical courses
    • Advanced SQL Server masterclass for data analytics
    • Building full-stack apps with AI
    • Complete Cursor
    • DataOps methodology
    • DeepLearning.AI short courses
    • Generative AI for software development
      • Introduction to generative AI for software development
      • Team software engineering with AI
      • AI-powered software and system design
    • Generative AI with large language models
    • Generative pre-trained transformers
    • IBM DevOps and software engineering
      • Introduction to agile development and scrum
      • Introduction to cloud computing
      • Introduction to DevOps
    • Machine learning in production
    • Reinforcement learning specialization
      • Fundamentals of reinforcement learning
      • Sample-based learning methods
      • Prediction and control with function approximation
  • Non-technical books
    • Management skills for everyday life (in progress)
  • Non-technical courses
    • Business communication and effective communication specializations
      • Business writing
      • Graphic design
      • Successful presentation
      • Giving helpful feedback (not started)
      • Communicating effectively in groups (not started)
    • Illinois Tech MBA courses
      • Competitive strategy (in progress)
    • Leading people and teams specialization
      • Inspiring and motivating individuals
      • Managing talent
      • Influencing people
      • Leading teams
Powered by GitBook
On this page
  • Part 1: Foundations
  • Chapter 2: A gentle start
  • Chapter 3: A formal learning model
  • Chapter 4: Learning via uniform convergence
  • Chapter 5: The bias-complexity tradeoff
  • Chapter 6: The Vapnik-Chervonenkis dimension
  1. Technical books

Understanding machine learning: from theory to algorithms (in progress)

Part 1: Foundations

Chapter 2: A gentle start

Empirical risk is the average loss (error) measured on a set of observed data points. Empirical risk minimization (unsurprisingly) finds a hypothesis that minimizes empirical risk.

As there are infinitely many such hypotheses and the process tends to lead to overfitting, we can apply empirical risk minimization over a restricted search space, known as the hypothesis class, where the choice of restrictions reflects our inductive bias about the underlying problem structure

ERMH(S)∈arg min⁡h∈HLS(h)\text{ERM}_\mathcal{H}(S) \in \argmin_{h \in \mathcal{H}} L_S(h)ERMH​(S)∈h∈Hargmin​LS​(h)

Union bound: D(A∪B)≤D(A)+D(B)\mathcal{D}(A\cup B) \leq \mathcal{D}(A)+\mathcal{D}(B)D(A∪B)≤D(A)+D(B) (obvious result, but I've not heard that name before)

Chapter 3: A formal learning model

The realizable assumption in Probably Approximately Correct learning means that the target function belongs to the hypothesis class (or at least that there is a hypothesis in the class that matches it with probability 1; they could disagree on a set of measure 0).

A hypothesis class H\mathcal{H}H is Probably Approximately Correct learnable if there exists a function mH ⁣:(0,1)2→Nm_\mathcal{H} \colon (0,1)^2 \to \mathbb{N}mH​:(0,1)2→N and a learning algorithm with the property: For every ε,δ∈(0,1)\varepsilon,\delta \in (0,1)ε,δ∈(0,1), for every distribution D\mathcal{D}D over X\mathcal{X}X, and for every labelling function f ⁣:X→{0,1}f \colon X \to \{0,1\}f:X→{0,1}, if the realizable assumption holds with respect to H,D,f\mathcal{H},\mathcal{D},fH,D,f, then when running the learning algorithm on m≥mH(ε,δ)m \geq m_\mathcal{H}(\varepsilon,\delta)m≥mH​(ε,δ) i.i.d. examples generated by D\mathcal{D}D and labelled by fff, the algorithm returns a hypothesis hhh such that, with probability of at least 1−δ1-\delta1−δ (over the choice of the examples), L(D,f)(h)≤εL_{(\mathcal{D},f)}(h)\leq \varepsilonL(D,f)​(h)≤ε.

Paraphrasing Anchorman: At least 1−δ1-\delta1−δ of the time, it works with error at most ε\varepsilonε, every time.

The minimal such mHm_\mathcal{H}mH​ determines the sample complexity of learning H\mathcal{H}H.

Every finite hypothesis class in PAC learnable, as are some infinite ones. The spoiler for Chapter 6: The Vapnik-Chervonenkis dimension is that it's based on whether the Vapnik-Chervonenkis dimension is finite.

The realizable assumption is totally impractical in practice. Relaxing it gives agnostic PAC learnability, where the error bound is LD(h)≤min⁡h′∈HLD(h′)+ϵL_\mathcal{D}(h) \leq \min_{h' \in \mathcal{H}} L_\mathcal{D}(h') + \epsilonLD​(h)≤minh′∈H​LD​(h′)+ϵ.

All of the examples in the book so far have assumed that we are trying to predict a binary label, but there's no reason that it can't be expanded to other problems by changing the loss function.

The risk function gives the expected loss for a given hypothesis.

Improper learning relaxes the requirement that the learning algorithm output a hypothesis from the original hypothesis class. The algorithm can output a hypothesis from a larger class as long as it achieves the desired error bounds.

Chapter 4: Learning via uniform convergence

A sample is ε-representative if the expected loss on that sample is within epsilon of the expected loss on the distribution for all hypotheses in the hypothesis class.

Applying empirical risk minimization to an ε-representative sample will result in a hypothesis with low loss on the true distribution. As a result, if the sample is ε-representative with probability at least 1−δ1-\delta1−δ then the ERM rule is an agnostic PAC learner. This is known as the uniform convergence property. (The "uniform" is because there is a fixed sample size that works for all hypotheses in the class and all possible probability distributions over the domain.)

Finite hypothesis classes can be shown to have the uniform convergence property using Hoeffding's inequality.

Chapter 5: The bias-complexity tradeoff

The No Free Lunch Theorem states that when averaged over all possible data distributions, every learning algorithm performs exactly the same; there is no universal learner. As a result, overly broad hypothesis classes are not PAC learnable.

We can decompose the error into the approximation error () and the estimation error (the additional error resulting from the vagaries of the particular sample). A richer hypothesis class will result in lower approximation error but higher estimation error. This is the bias-variance tradeoff (or the bias-complexity tradeoff as it is termed in this book).

Chapter 6: The Vapnik-Chervonenkis dimension

Infinite classes can be PAC learnable. The example given is threshold functions on the real line. For any epsilon, we can define an interval about the true value that contains epsilon probability mass on each side (or is unbounded on that side and has probability less than epsilon). With a sufficiently large sample size, the algorithm is likely to produce a threshold inside this interval.

The restriction of H\mathcal{H}H to a finite set CCC is the set of all possible labellings that H\mathcal{H}H can generate on CCC. That is HC={(h(c1),…,h(cm)):h∈H}\mathcal{H}_C = \{(h(c_1),\ldots,h(c_m)) : h \in \mathcal{H}\}HC​={(h(c1​),…,h(cm​)):h∈H}.

H\mathcal{H}H shatters CCC if its restriction to CCC is the set of all possible functions from CCC to {0,1}\{0,1\}{0,1}. That is ∣HC∣=2∣C∣|\mathcal{H}_C|=2^{|C|}∣HC​∣=2∣C∣.

Returning to the previous example, the hypothesis class of threshold functions on the real line cannot shatter a set with two or more elements.

The Vapnik-Chervonenkis dimension of a hypothesis class is the size of the largest set that can be shattered by the hypothesis class.

If H\mathcal{H}H shatters some set of size 2m2m2m then we cannot learn H\mathcal{H}H from mmm examples as there is too much flexibility. As a result, hypothesis classes with infinite VC dimension are not PAC learnable.

Last updated 7 months ago