Training is the iterative process of adjusting parameters $θ$ (weights and biases) to minimise a Loss Score. The choice of loss function depends on whether you are performing regression or classification.

Regression Loss Function

Used when the target variable $(y)$ is a real-valued number $(y \in R)$ .

Mean Squared Error (MSE)

Formula: $\frac{1}{m} \sum_{i = 1}^{m} (y_{i} - \overset{y}{^}_{i})^{2}$
Purpose: This function finds the mean of conflicting targets.
Notes: It heavily penalises large errors because the difference is squared.

Mean Absolute Error (MAE)

Formula: $\frac{1}{m} \sum_{i = 1}^{m} ∣ y_{i} - \overset{y}{^}_{i} ∣$
Purpose: This function finds the median of conflicting targets.
Notes: It is more robust to outliers than MSE.

Classification Loss Function

Used when the target variable $(y)$ is a discrete label or category.

Binary Cross Entropy (BCE)

Used for Binary Classification (two classes, e.g., $0$ or $1$ ).

Formula: $- \frac{1}{m} \sum_{i = 1}^{m} [y_{i} lo g \overset{y}{^}_{i} + (1 - y_{i}) lo g (1 - \overset{y}{^}_{i})]$
Interpretation: The output $\overset{y}{^}$ is interpreted as the probability $p (y = 1∣ x)$
Logic:
- If the true label $y = 1$ , only the first term $y lo g \overset{y}{^}$ remains.
- If the true label $y = 0$ , only the second term $((1 - y) lo g (1 - \overset{y}{^}))$ remains.

Softmax Output Layer

Used in Multiclass Classification to turn network outputs (logits) into a probability distribution.

Formula: $\overset{y}{^}_{j} = \frac{e x p ( z _{j} )}{\sum _{k} e x p ( z _{k} )}$
Purpose: Ensure that all outputs are positive and that they sum to $1$ .

Multiclass Cross Entropy (CE)

Used when there are multiple categories (e.g., 10 digits MNIST).

Formula: $- \frac{1}{m} \sum_{i = 1}^{m} \sum_{j} y_{i, j} lo g \overset{y}{^}_{i, j}$
Context: Requires 1-hot-encoding for the targets ( $y$ ), meaning only one class index is non-zero (equal to 1) while others are 0.
Note: Because of 1-hot encoding, only one term in the inner summation will be non-zero for each sample.

Regularisation

Used to prevent overfitting by adding a penalty for complexity.

Weight Decay

Formula: $L (θ) = L_{t a s k} (θ) + λ \sum θ^{2}$
Variables:
- $L_{t a s k} (θ)$ : The original loss (e.g., MSE or Cross Entropy)
- $λ$ : A hyperparameter ( $> 0$ ) that controls the strength of the penalty
- $\sum θ^{2}$ : The sum of the squares of all weights and biases in the network
- Goal: Keeps weights small to prevent the model from learning the training data “by heart”

Parameter Calculation Formula

To determine the total parameters in a single layer:

Formula: $m^{2} + m$
Logic: $m^{2}$ represents the weights connecting $m$ units to $m$ units in the previous layer, $m$ represents the biases for each unit in the current layer.

Ayush Acharjya's Notes

Explorer

11 Training and Loss Function