39 Bootstrapping ensembles

Bootstrapping ensembles repeatedly resample the training set, train a separate model on each sample, and average their predictions. This can reduce variance.

Bootstrap for Estimating Epistemic Uncertainty

Predictive uncertainty can be written as a sum of aleatoric and epistemic uncertainty :

y - \hat{f} (x; S) = y - \hat{f}_{\hat{θ}} (x) = Aleatoric uncertainty y - f_{θ^{*}} (x) + Epistemic uncertainty f_{θ^{*}} (x) - f_{\hat{θ}} (x)

where $θ^{*}$ are the true parameters of the function we try to learn : Hard to disentangle the two uncertainties of different nature :

We directly observe the predictive uncertainty $y - \hat{f} (x; S)$
But $θ^{*}$ is unknown.

Epistemic Uncertainty

The epistemic uncertainty is :

Epistemic (x) = f_{θ^{*}} (x) - \hat{f} (x)

Note Epistemic(x) is a random function of the random sample S. Thus Epistemic(x) is itself a random variable

Goal : Estimate the distribution $Ep i s t e mi c (x)$
Assumption : Our model is unbiased (i.e. we need to assume its Bias is zero).

E_{S} [\hat{f}_{θ} (x)] = f_{θ^{*}} (x)

But $S \sim i.i.d. D$ but $D$ is an unknown distribution.

What if we know $D$ ?

we could take $k$ (a very large number of) samples of size $n$ , each i.i.d. from $D$ , train the model on each.
Then ${f_{\hat{θ}}^{(i)} (x) - f_{θ^{*}} (x)}_{i = 1}^{k}$ are i.i.d. samples from $Ep i s t e mi c (x)$ .
By our unbiasedness assumptions :

f_{θ^{*}} (x) = E_{S} [\hat{f}_{θ} (x)] \approx \frac{1}{k} i = 1 \sum k f_{\hat{θ}}^{(i)} (x) = \overset{μ}{^} (x)

${f_{\hat{θ}}^{(i)} (x) - \overset{μ}{^} (x)}_{i = 1}^{k}$ are approximately i.i.d samples from $Ep i s t e mi c (x)$ .
Problem : We don’t have access to unlimited number samples from $D$ , but only have a single sample, $S$ .

Bootstrap

Idea : Given samples $x_{1}, \dots, x_{n} \sim D$ , we can approximate the unknown probability distribution $D$ by (here, $δ (x - x_{i})$ , denotes a Dirac delta centred on $x_{i}$ ):

P (x) = Pr (X = x) \approx \frac{1}{n} i = 1 \sum n δ (x - x_{i}) = \hat{P} (x)

This can be made to work for continuous distributions using the probability density function :

p (x) \approx \overset{p}{^} (x) = \frac{1}{n} i = 1 \sum n δ (x - x_{i}) = \overset{p}{^} (x)

Now use this in place of the unknown $D$ .

Practical Algorithm of Estimating Epistemic Uncertainty

Algorithm Bootstrap (sample $S = {(x_{i}, y_{i})}_{i = 1}^{n}$ )

Define empirical distribution :

p (x) = \frac{1}{n} i = 1 \sum n δ (x - x_{i})

Generate $k$ bootstrap datasets $S_{1}, \dots, S_{k} \sim \hat{P}$ .
Train models on each dataset : $\hat{f}^{(i)} = train (S_{i})$
Return epistemic uncertainty estimate :

U (x) = {\hat{f}^{(i)} (x) - \overset{μ}{^} (x)}_{i = 1}^{k}

where $\overset{μ}{^} (x) = \frac{1}{k} \sum_{i = 1}^{k} \hat{f}^{(i)} (x)$

Ayush Acharjya's Notes

Explorer

39 Bootstrapping ensembles

Bootstrap for Estimating Epistemic Uncertainty

Epistemic Uncertainty

What if we know $D$ ?

Bootstrap

Practical Algorithm of Estimating Epistemic Uncertainty

Graph View

Table of Contents

Backlinks

Ayush Acharjya's Notes

Explorer

39 Bootstrapping ensembles

Bootstrap for Estimating Epistemic Uncertainty

Epistemic Uncertainty

What if we know D ?

Bootstrap

Practical Algorithm of Estimating Epistemic Uncertainty

Graph View

Table of Contents

Backlinks

What if we know $D$ ?