28 Gini Index

How do we find the best rules to split the samples ?

Gini Index (Gini Impurity)

Used in CART (Classification And Regression Trees)

I_{G} (p) = 1 - i = 1 \sum J p_{i}^{2}

where $p_{i}$ is the fraction of items labelled with class $i$ in the dataset.

Information Gain

Measures the reduction in entropy after splitting a dataset based on a feature.

I G (Y, X) = H (Y) - H (Y ∣ X)

where :

$Y$ : Random variable representing the target (labels).
$X$ : Random variable representing a feature of the input sample.
$H (Y)$ : Entropy of Y.
$H (Y ∣ X)$ : Conditional entropy of $Y$ given $X$ .

Interpretation :

Information gain quantifies the improvement in classifying labels after using a feature to split the dataset.
The feature that maximizes information gain is chosen for the split.

Important : To prevent overfitting, we don’t want too many leaf’s, even if the leaf nodes don’t achieve 0 entropy. Approaches to guard against overfitting

Early stopping
Post-pruning

Ayush Acharjya's Notes

Explorer

28 Gini Index

Gini Index (Gini Impurity)

Information Gain

Graph View

Table of Contents

Backlinks