34 Estimator

Suppose we have many training sets of size $n$ generated from the some unknown distribution. From each we estimate a separate $\hat{θ_{n}}$ .

Bias of an Estimator :

Definition : Bias of $\hat{θ_{n}}$ is $Bias (\hat{θ_{n}}) = E [\hat{θ}_{n}] - θ^{*}$
An estimator is unbiased if $E [\hat{θ}_{n}] = θ^{*}$ for all $θ^{*}$ .

Variance of an Estimator :

Definition : Variance of $\hat{θ_{n}}$ is $Var (\hat{θ}_{n}) = E [(\hat{θ}_{n} - θ^{*})^{2}]$ (if $\hat{θ}$ is a scalar variable) or $C ov [\hat{θ}_{n}]$ (if $\hat{θ}$ is a vector variable).
Unlike bias, the variance does not directly depend on the true parameter $θ^{*}$ .

Bias-Variance decomposition of the Mean Squared Error

We look at the expected square error (expectation over the distribution that generated the training sets) from the true parameter value $θ^{*}$ .

E [(\hat{θ} - θ^{*})^{2}] = E [\hat{θ}^{2}] - 2 E [\hat{θ}] θ^{*} + (θ^{*})^{2}

Add & Subtract $E [\hat{θ}]^{2}$ to complete the square :

= E [\hat{θ}^{2}] + (θ^{*} - E [\hat{θ}])^{2} - (E [\hat{θ}])^{2}

Rearrange to conclude :

E [(\hat{θ} - θ^{*})^{2}] = Bias [\hat{θ}]^{2} (θ^{*} - E [\hat{θ}])^{2} + Var [\hat{θ}] E [(\hat{θ} - E [\hat{θ}])^{2}]

MLE and MAP estimators

The Maximum Likelihood Estimator for $θ$ is :

θ_{MLE} = \frac{\sum _{i = 1}^{n} X _{i}}{n}

The MLE is a sample mean of Bernoulli trials : The Maximum a Posteriori (MAP) Estimator for $θ$ is :

θ_{MAP} = \frac{\sum _{i = 1}^{n} X _{i} + α - 1}{n + α + β - 2}

The MAP is the maximiser of the posterior distribution of $θ$ when the prior distribution is $p (θ) = B e t a (α, β)$ . Note : Bias-variance is a frequentist concept. While MAP(and Bayesian mean) derive from Bayesian framework, all estimators can be analysed in the frequentist framework.

Bias of the Estimator :

Bias is defined as :

Bias (\hat{θ}) = E [\hat{θ}] - θ

Bias of MLE :

Bias (\hat{θ}_{MLE}) = E [\hat{θ}_{MLE}] - θ = E \frac{\sum _{i = 1}^{n} X _{i}}{n} - θ (1)

= \frac{E [ \sum _{i = 1}^{n} X _{i} ]}{n} - θ = \frac{n θ}{n} - θ = θ - θ = 0 (2)

Thus, the MLE is unbiased. Bias of MAP :

Bias (\hat{θ}_{MAP}) = E [\hat{θ}_{MAP}] - θ = \frac{n θ + α - 1}{n + α + β - 2} - θ

Thus, the MAP estimator is biased towards the prior mean, especially for small $n$ , but becomes unbiased as $n$ grows large.

Variance of the Estimators :

Variance is given by:

Var (\hat{θ}) = E [\hat{θ}^{2}] - (E [\hat{θ}])^{2}

Variance of MLE :

Var (\hat{θ}_{MLE}) = Var \frac{\sum _{i = 1}^{n} X _{i}}{n} = \frac{Var ( \sum _{i = 1}^{n} X _{i} )}{n ^{2}} = \frac{n θ ( 1 - θ )}{n ^{2}} = \frac{θ ( 1 - θ )}{n}

Variance if MAP Estimator :

Var (\hat{θ}_{MAP}) = \frac{1}{( n + α + β - 2 ) ^{2}} \cdot Var i = 1 \sum n X_{i} = \frac{n θ ( 1 - θ )}{( n + α + β - 2 ) ^{2}}

This can be made smaller than the $Var (\hat{θ}_{MLE})$ by an informative prior ( $α, β$ away from 1). Notice the trade-off with $B ia s (\hat{θ}_{MAP})$ . For large $n$ , this becomes approx. $Var (\hat{θ}_{MAP}) \approx \frac{θ ( 1 - θ )}{n + α + β - 2}$ . Similar to the MLE, but with a denominator that includes prior information ( $α and β$ ).

Implications of Bias-Variance Analysis

We want estimators that have low bias and low variance - but both are not achievable simultaneously with a finite sample, and there are trade-offs.
Bias-variance properties of estimators can guide the choice of estimator to use. MLE has low bias if $n$ is sufficiently large, but it has high variance.
High bias, low variance estimators (e.g., regularization methods - also interpretable as MAP estimators) improve stability and generalization.
Bayesian estimators incorporate prior information, introducing bias but reducing variance. The Bayesian posterior mean is often biased but achieves lower MSE than frequentist estimators.
The best choice depends on sample size, prior knowledge, and application needs.

Ayush Acharjya's Notes

Explorer