How do we find the best rules to split the samples ?

Gini Index (Gini Impurity)

Used in CART (Classification And Regression Trees)

where is the fraction of items labelled with class in the dataset.

Information Gain

Measures the reduction in entropy after splitting a dataset based on a feature.

where :

  • : Random variable representing the target (labels).
  • : Random variable representing a feature of the input sample.
  • : Entropy of Y.
  • : Conditional entropy of given .

Interpretation :

  • Information gain quantifies the improvement in classifying labels after using a feature to split the dataset.
  • The feature that maximizes information gain is chosen for the split.

Important : To prevent overfitting, we don’t want too many leaf’s, even if the leaf nodes don’t achieve 0 entropy. Approaches to guard against overfitting

  • Early stopping
  • Post-pruning