# Understanding machine learning: from theory to algorithms (in progress)

## Part 1: Foundations

### Chapter 2: A gentle start

**Empirical risk** is the average loss (error) measured on a set of observed data points. **Empirical risk minimization** (unsurprisingly) finds a hypothesis that minimizes empirical risk.

As there are infinitely many such hypotheses and the process tends to lead to overfitting, we can apply empirical risk minimization over a restricted search space, known as the **hypothesis class**, where the choice of restrictions reflects our **inductive bias** about the underlying problem structure

$$
\text{ERM}*\mathcal{H}(S) \in \argmin*{h \in \mathcal{H}} L\_S(h)
$$

**Union bound:** $$\mathcal{D}(A\cup B) \leq \mathcal{D}(A)+\mathcal{D}(B)$$ (obvious result, but I've not heard that name before)

### Chapter 3: A formal learning model

The **realizable assumption** in **Probably Approximately Correct** learning means that the target function belongs to the hypothesis class (or at least that there is a hypothesis in the class that matches it with probability 1; they could disagree on a set of measure 0).

A hypothesis class $$\mathcal{H}$$ is **Probably Approximately Correct learnable** if there exists a function $$m\_\mathcal{H} \colon (0,1)^2 \to \mathbb{N}$$ and a learning algorithm with the property: For every $$\varepsilon,\delta \in (0,1)$$, for every distribution $$\mathcal{D}$$ over $$\mathcal{X}$$, and for every labelling function $$f \colon X \to {0,1}$$, if the realizable assumption holds with respect to $$\mathcal{H},\mathcal{D},f$$, then when running the learning algorithm on $$m \geq m\_\mathcal{H}(\varepsilon,\delta)$$  i.i.d. examples generated by $$\mathcal{D}$$ and labelled by $$f$$, the algorithm returns a hypothesis $$h$$ such that, with probability of at least $$1-\delta$$ (over the choice of the examples), $$L\_{(\mathcal{D},f)}(h)\leq \varepsilon$$.

Paraphrasing Anchorman: At least $$1-\delta$$ of the time, it works with error at most $$\varepsilon$$, every time.

The minimal such $$m\_\mathcal{H}$$ determines the **sample complexity** of learning $$\mathcal{H}$$.

Every finite hypothesis class in PAC learnable, as are some infinite ones. The spoiler for [#chapter-6-the-vapnik-chervonenkis-dimension](#chapter-6-the-vapnik-chervonenkis-dimension "mention") is that it's based on whether the **Vapnik-Chervonenkis dimension** is finite.

The realizable assumption is totally impractical in practice. Relaxing it gives **agnostic** PAC learnability, where the error bound is $$L\_\mathcal{D}(h) \leq \min\_{h' \in \mathcal{H}} L\_\mathcal{D}(h') + \epsilon$$.

All of the examples in the book so far have assumed that we are trying to predict a binary label, but there's no reason that it can't be expanded to other problems by changing the loss function.

The **risk function** gives the expected loss for a given hypothesis.

**Improper learning** relaxes the requirement that the learning algorithm output a hypothesis from the original hypothesis class. The algorithm can output a hypothesis from a larger class as long as it achieves the desired error bounds.

### Chapter 4: Learning via uniform convergence

A sample is **ε-representative** if the expected loss on that sample is within epsilon of the expected loss on the distribution for all hypotheses in the hypothesis class.

Applying empirical risk minimization to an ε-representative sample will result in a hypothesis with low loss on the true distribution. As a result, if the sample is ε-representative with probability at least $$1-\delta$$ then the ERM rule is an agnostic PAC learner. This is known as the **uniform convergence** property. (The "uniform" is because there is a fixed sample size that works for all hypotheses in the class and all possible probability distributions over the domain.)

Finite hypothesis classes can be shown to have the uniform convergence property using **Hoeffding's inequality**.

### Chapter 5: The bias-complexity tradeoff

The **No Free Lunch Theorem** states that when averaged over all possible data distributions, every learning algorithm performs exactly the same; there is no universal learner. As a result, overly broad hypothesis classes are not PAC learnable.

We can decompose the error into the **approximation error** ([the minimum risk achievable by a predictor in the hypothesis class](#user-content-fn-1)[^1]) and the **estimation error** (the additional error resulting from the vagaries of the particular sample). A richer hypothesis class will result in lower approximation error but higher estimation error. This is the **bias-variance tradeoff** (or the **bias-complexity tradeoff** as it is termed in this book).

### Chapter 6: The Vapnik-Chervonenkis dimension

Infinite classes can be PAC learnable. The example given is threshold functions on the real line. For any epsilon, we can define an interval about the true value that contains epsilon probability mass on each side (or is unbounded on that side and has probability less than epsilon). With a sufficiently large sample size, the algorithm is likely to produce a threshold inside this interval.

The **restriction** of $$\mathcal{H}$$ to a finite set $$C$$ is the set of all possible labellings that $$\mathcal{H}$$ can generate on $$C$$. That is $$\mathcal{H}\_C = {(h(c\_1),\ldots,h(c\_m)) : h \in \mathcal{H}}$$.

$$\mathcal{H}$$ **shatters** $$C$$ if its restriction to $$C$$ is the set of all possible functions from $$C$$ to $${0,1}$$. That is $$|\mathcal{H}\_C|=2^{|C|}$$.

Returning to the previous example, the hypothesis class of threshold functions on the real line cannot shatter a set with two or more elements.

The **Vapnik-Chervonenkis dimension** of a hypothesis class is the size of the largest set that can be shattered by the hypothesis class.

If $$\mathcal{H}$$ shatters some set of size $$2m$$ then we cannot learn $$\mathcal{H}$$ from $$m$$ examples as there is too much flexibility. As a result, hypothesis classes with infinite VC dimension are not PAC learnable.

[^1]: Either in absolute terms or relative to the Bayes optimal predictor.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://www.raoulharris.com/technical-books/understanding-machine-learning-from-theory-to-algorithms-in-progress.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
