18 Training Optimisation and Normalisation

Learning Rate Scheduling

Start with a large learning rate and reduce it during training (e.g., every $n$ epochs or when validation loss plateaus).

Network performs best when data distributions are consistent.

Feed Scaling: Dividing pixel values $(0 - 255)$ by $255$ to get a $0 - 1$ range.

x_{n or m} = \frac{x}{255}

Min-Max Norm: Scaling data specifically between its maximum and minimum values.

x_{n or m} = \frac{x - min}{ma x - min}

Z-Score Norm: $(x - μ) / σ$ . This centres at $0$ with standard deviation of $1$ .

x_{n or m} = \frac{x - μ}{σ}

Concept: Normalising the output of every neuron or filter within a batch.
Learnable Parameters: BN introduces $γ$ (scaling) and $β$ (shifting) which the network learns to fund the optimal distribution.
Test Time Difficulty: During testing, there is no “batch”.
- Solution: Precompute the mean $(μ)$ and standard deviation $(σ)$ during training and use those fixed values during testing
Efficacy: BN significantly speeds up training and improves validation accuracy.

\overset{x}{^} = γ (\frac{x - μ}{σ + ϵ}) + β

$μ, σ$ : Mean and standard deviation of the batch
$γ, β$ : Learnable parameters (scale and shift) that allow the network to undo the noralisation if necessary
$ϵ$ : A tiny constant for numerical stability