Training is the iterative process of adjusting parameters (weights and biases) to minimise a Loss Score. The choice of loss function depends on whether you are performing regression or classification.
Regression Loss Function
Used when the target variable is a real-valued number .
Mean Squared Error (MSE)
- Formula:
- Purpose: This function finds the mean of conflicting targets.
- Notes: It heavily penalises large errors because the difference is squared.
Mean Absolute Error (MAE)
- Formula:
- Purpose: This function finds the median of conflicting targets.
- Notes: It is more robust to outliers than MSE.
Classification Loss Function
Used when the target variable is a discrete label or category.
Binary Cross Entropy (BCE)
Used for Binary Classification (two classes, e.g., or ).
- Formula:
- Interpretation: The output is interpreted as the probability
- Logic:
- If the true label , only the first term remains.
- If the true label , only the second term remains.
Softmax Output Layer
Used in Multiclass Classification to turn network outputs (logits) into a probability distribution.
- Formula:
- Purpose: Ensure that all outputs are positive and that they sum to .
Multiclass Cross Entropy (CE)
Used when there are multiple categories (e.g., 10 digits MNIST).
- Formula:
- Context: Requires 1-hot-encoding for the targets (), meaning only one class index is non-zero (equal to 1) while others are 0.
- Note: Because of 1-hot encoding, only one term in the inner summation will be non-zero for each sample.
Regularisation
Used to prevent overfitting by adding a penalty for complexity.
Weight Decay
- Formula:
- Variables:
- : The original loss (e.g., MSE or Cross Entropy)
- : A hyperparameter () that controls the strength of the penalty
- : The sum of the squares of all weights and biases in the network
- Goal: Keeps weights small to prevent the model from learning the training data βby heartβ
Parameter Calculation Formula
To determine the total parameters in a single layer:
- Formula:
- Logic: represents the weights connecting units to units in the previous layer, represents the biases for each unit in the current layer.