Training is the iterative process of adjusting parameters (weights and biases) to minimise a Loss Score. The choice of loss function depends on whether you are performing regression or classification.

Regression Loss Function

Used when the target variable is a real-valued number .

Mean Squared Error (MSE)

  • Formula:
  • Purpose: This function finds the mean of conflicting targets.
  • Notes: It heavily penalises large errors because the difference is squared.

Mean Absolute Error (MAE)

  • Formula:
  • Purpose: This function finds the median of conflicting targets.
  • Notes: It is more robust to outliers than MSE.

Classification Loss Function

Used when the target variable is a discrete label or category.

Binary Cross Entropy (BCE)

Used for Binary Classification (two classes, e.g., or ).

  • Formula:
  • Interpretation: The output is interpreted as the probability
  • Logic:
    • If the true label , only the first term remains.
    • If the true label , only the second term remains.

Softmax Output Layer

Used in Multiclass Classification to turn network outputs (logits) into a probability distribution.

  • Formula:
  • Purpose: Ensure that all outputs are positive and that they sum to .

Multiclass Cross Entropy (CE)

Used when there are multiple categories (e.g., 10 digits MNIST).

  • Formula:
  • Context: Requires 1-hot-encoding for the targets (), meaning only one class index is non-zero (equal to 1) while others are 0.
  • Note: Because of 1-hot encoding, only one term in the inner summation will be non-zero for each sample.

Regularisation

Used to prevent overfitting by adding a penalty for complexity.

Weight Decay

  • Formula:
  • Variables:
    • : The original loss (e.g., MSE or Cross Entropy)
    • : A hyperparameter () that controls the strength of the penalty
    • : The sum of the squares of all weights and biases in the network
    • Goal: Keeps weights small to prevent the model from learning the training data β€œby heart”

Parameter Calculation Formula

To determine the total parameters in a single layer:

  • Formula:
  • Logic: represents the weights connecting units to units in the previous layer, represents the biases for each unit in the current layer.