Understanding machine learning: from theory to algorithms (in progress)

Part 1: Foundations

Chapter 2: A gentle start

Empirical risk is the average loss (error) measured on a set of observed data points. Empirical risk minimization (unsurprisingly) finds a hypothesis that minimizes empirical risk.

As there are infinitely many such hypotheses and the process tends to lead to overfitting, we can apply empirical risk minimization over a restricted search space, known as the hypothesis class, where the choice of restrictions reflects our inductive bias about the underlying problem structure

\text{ERM}_\mathcal{H}(S) \in \argmin_{h \in \mathcal{H}} L_S(h)

Union bound: $\mathcal{D}(A\cup B) \leq \mathcal{D}(A)+\mathcal{D}(B)$ (obvious result, but I've not heard that name before)

Chapter 3: A formal learning model

The realizable assumption in Probably Approximately Correct learning means that the target function belongs to the hypothesis class (or at least that there is a hypothesis in the class that matches it with probability 1; they could disagree on a set of measure 0).

A hypothesis class $\mathcal{H}$ is Probably Approximately Correct learnable if there exists a function $m_\mathcal{H} \colon (0,1)^2 \to \mathbb{N}$ and a learning algorithm with the property: For every $\varepsilon,\delta \in (0,1)$ , for every distribution $\mathcal{D}$ over $\mathcal{X}$ , and for every labelling function $f \colon X \to \{0,1\}$ , if the realizable assumption holds with respect to $\mathcal{H},\mathcal{D},f$ , then when running the learning algorithm on $m \geq m_\mathcal{H}(\varepsilon,\delta)$ i.i.d. examples generated by $\mathcal{D}$ and labelled by $f$ , the algorithm returns a hypothesis $h$ such that, with probability of at least $1-\delta$ (over the choice of the examples), $L_{(\mathcal{D},f)}(h)\leq \varepsilon$ .

Paraphrasing Anchorman: At least $1-\delta$ of the time, it works with error at most $\varepsilon$ , every time.

The minimal such $m_\mathcal{H}$ determines the sample complexity of learning $\mathcal{H}$ .

Every finite hypothesis class in PAC learnable, as are some infinite ones. The spoiler for Chapter 6: The Vapnik-Chervonenkis dimension is that it's based on whether the Vapnik-Chervonenkis dimension is finite.

The realizable assumption is totally impractical in practice. Relaxing it gives agnostic PAC learnability, where the error bound is $L_\mathcal{D}(h) \leq \min_{h' \in \mathcal{H}} L_\mathcal{D}(h') + \epsilon$ .

All of the examples in the book so far have assumed that we are trying to predict a binary label, but there's no reason that it can't be expanded to other problems by changing the loss function.

The risk function gives the expected loss for a given hypothesis.

Improper learning relaxes the requirement that the learning algorithm output a hypothesis from the original hypothesis class. The algorithm can output a hypothesis from a larger class as long as it achieves the desired error bounds.

Chapter 4: Learning via uniform convergence

A sample is ε-representative if the expected loss on that sample is within epsilon of the expected loss on the distribution for all hypotheses in the hypothesis class.

Applying empirical risk minimization to an ε-representative sample will result in a hypothesis with low loss on the true distribution. As a result, if the sample is ε-representative with probability at least $1-\delta$ then the ERM rule is an agnostic PAC learner. This is known as the uniform convergence property. (The "uniform" is because there is a fixed sample size that works for all hypotheses in the class and all possible probability distributions over the domain.)

Finite hypothesis classes can be shown to have the uniform convergence property using Hoeffding's inequality.

Chapter 5: The bias-complexity tradeoff

The No Free Lunch Theorem states that when averaged over all possible data distributions, every learning algorithm performs exactly the same; there is no universal learner. As a result, overly broad hypothesis classes are not PAC learnable.

We can decompose the error into the approximation error () and the estimation error (the additional error resulting from the vagaries of the particular sample). A richer hypothesis class will result in lower approximation error but higher estimation error. This is the bias-variance tradeoff (or the bias-complexity tradeoff as it is termed in this book).

Chapter 6: The Vapnik-Chervonenkis dimension

Infinite classes can be PAC learnable. The example given is threshold functions on the real line. For any epsilon, we can define an interval about the true value that contains epsilon probability mass on each side (or is unbounded on that side and has probability less than epsilon). With a sufficiently large sample size, the algorithm is likely to produce a threshold inside this interval.

The restriction of $\mathcal{H}$ to a finite set $C$ is the set of all possible labellings that $\mathcal{H}$ can generate on $C$ . That is $\mathcal{H}_C = \{(h(c_1),\ldots,h(c_m)) : h \in \mathcal{H}\}$ .

$\mathcal{H}$ shatters $C$ if its restriction to $C$ is the set of all possible functions from $C$ to $\{0,1\}$ . That is $|\mathcal{H}_C|=2^{|C|}$ .

Returning to the previous example, the hypothesis class of threshold functions on the real line cannot shatter a set with two or more elements.

The Vapnik-Chervonenkis dimension of a hypothesis class is the size of the largest set that can be shattered by the hypothesis class.

If $\mathcal{H}$ shatters some set of size $2m$ then we cannot learn $\mathcal{H}$ from $m$ examples as there is too much flexibility. As a result, hypothesis classes with infinite VC dimension are not PAC learnable.

Last updated 11 months ago