Exercise 02

Question 1

Suppose we have a simplified language with:

Three vowels : $a, i and u$
Three consonants: $p, t and k$ Let $X$ map consonants ${p, t, k}$ to range $R_{X} = {1, 2, 3}$ Let $Y$ map vowels ${a, i, u}$ to range $R_{Y} = {1, 2, 3}$ . Suppose we estimated the joint PMF for vowel and consonant co-occurrences within syllables, as given below.

P (x, y) = x \ y 123 1 \frac{1}{16} \frac{1}{16} 0 2 \frac{3}{8} \frac{3}{16} \frac{3}{16} 3 \frac{1}{16} 0 \frac{1}{16}

Calculate the following :

P (x, y) = x \ y 123 P (y) 1 \frac{1}{16} \frac{1}{16} 0 \frac{1}{8} 2 \frac{3}{8} \frac{3}{16} \frac{3}{16} \frac{3}{4} 3 \frac{1}{16} 0 \frac{1}{16} \frac{1}{8} P (x) \frac{1}{2} \frac{1}{4} \frac{1}{4} 1

Entropies H(X) and H(Y)

H (X) = - i = 1 \sum m P (x_{i}) lo g_{2} P (x_{i}) = - (\frac{1}{8} lo g_{2} \frac{1}{8} + \frac{3}{4} lo g_{2} \frac{3}{4} + \frac{1}{8} lo g_{2} \frac{1}{8}) = 1.061

H (Y) = - i = 1 \sum m P (y_{i}) lo g_{2} P (y_{i}) = - (\frac{1}{2} lo g_{2} \frac{1}{2} + \frac{1}{4} lo g_{2} \frac{1}{4} + \frac{1}{4} lo g_{2} \frac{1}{4}) = 1.5

Joint entropy H(X, Y ) :

H (X, Y) = - x \in R_{X} \sum y \in R_{Y} \sum P (x, y) lo g P (x, y) = - (\frac{1}{16} lo g_{2} \frac{1}{16} + \frac{3}{8} lo g_{2} \frac{3}{8} + \frac{1}{16} lo g_{2} \frac{1}{16} + \frac{1}{16} lo g_{2} \frac{1}{16} + \frac{3}{16} lo g_{2} \frac{3}{16} + \frac{3}{16} lo g_{2} \frac{3}{16} + \frac{1}{16} lo g_{2} \frac{1}{16}) = 2.436

Conditional entropies H(X|Y ) and H(Y|X)

H (X ∣ Y) = H (X, Y) - H (Y) = 2.436 - 1.5 = 0.936

H (Y ∣ X) = H (X, Y) - H (X) = 2.436 - 1.061 = 1.375

Mutual information I(X; Y )

I (X; Y) = H (X) - H (X ∣ Y) = 1.061 - 0.936 = 0.125

Question 2

As a data scientist in a telecommunication company, your task is to analyse a customer dataset to predict whether a customer will terminate his/her contract. The dataset consists of around 8000 customer records, each consisting of one binary dependent variable $Y$ , indicating whether the customer terminates the contract $(Y = 1)$ or not $(Y = 0)$ , and 19 independent variables, which include the customer’s information, e.g., age, subscription plan, extra data plan, etc., and consumer behaviour such as the average number of calls and hours per week. Since your boss needs some actionable insights to retain customers, you decided to use interpretable machine learning methods. Design your interpretable machine learning method by answering the following questions:

You have implemented a feature selection algorithm based on mutual information to select the most informative features from the 19 independent variables. To validate the implementation of your mutual information calculation function, you use a small subset of the data to calculate mutual information manually. You select one independent variable subscription plan, denoted as $S$ , which takes two values, $S \in {1, 2}$ . Please use the following Probability Mass Function table:

p (S, Y) Y = 0 Y = 1 S = 1 \frac{2}{12} \frac{2}{12} S = 2 \frac{5}{12} \frac{3}{12}

Calculate :

p (S, Y) Y = 0 Y = 1 p (S) S = 1 \frac{2}{12} \frac{2}{12} \frac{4}{12} S = 2 \frac{5}{12} \frac{3}{12} \frac{8}{12} p (Y) \frac{7}{12} \frac{5}{12} 1

Entropy $H (S)$ .

H (S) = - i = 1 \sum m P (s_{i}) lo g_{2} P (s_{i}) = - (\frac{4}{12} lo g_{2} \frac{4}{12} + \frac{8}{12} lo g_{2} \frac{8}{12}) = 0.918

Entropy $H (Y)$ .

H (Y) = - i = 1 \sum m P (y_{i}) lo g_{2} P (y_{i}) = - (\frac{7}{12} lo g_{2} \frac{7}{12} + \frac{5}{12} lo g_{2} \frac{5}{12}) = 0.982

Joint entropy H(S,Y) :

H (S, Y) = - s \in R_{s} \sum y \in R_{Y} \sum P (s, y) lo g P (s, y) = - (\frac{2}{12} lo g_{2} \frac{2}{12} + \frac{5}{12} lo g_{2} \frac{5}{12} + \frac{2}{12} lo g_{2} \frac{2}{12} + \frac{3}{12} lo g_{2} \frac{3}{12}) = 1.89

Conditional entropy H(S|Y) :

H (S ∣ Y) = H (S, Y) - H (Y) = 1.89 - 0.982 = 0.908

Conditional entropy H(Y|S):

H (Y ∣ S) = H (S, Y) - H (S) = 1.89 - 0.918 = 0.972

Mutual information I(S; Y) :

I (S; Y) = H (S) - H (S ∣ Y) = 0.918 - 0.908 = 0.01

After applying your algorithm, you selected two variables:

Extra data plan $E$ , which is a binary random variable that indicates whether the customer subscribes to the extra data plan $(E = 1)$ or not $(E = 0)$ .
Averaged hours used per week $H$ , which is a continuous random variable. You then built a logistic regression model to classify customers into low risk or high risk of terminating the contract. The fitted model is:

lo g (\frac{p}{1 - p}) = - 0.77 + 0.23 H - 1.18 E

Given a customer x who has the extra data plan $(E = 1)$ and spent on average 0.5 hours per week, calculate the odds and the probability that the customer will terminate the contract $(Y = 1∣ x)$ .

lo g (\frac{p}{1 - p}) lo g (\frac{p}{1 - p}) \frac{p}{1 - p} \frac{p}{1 - p} p p p = - 0.77 + (0.23 \cdot 0.5) - (1.18 \cdot 1) = - 1.835 = - 1.835 = e^{- 1.835} = 0.159 = \frac{0.159}{1 + 0.159} = \frac{0.159}{1.159} \approx 0.137

Question 3

As a machine learning expert for an AI cybersecurity company, your task is to design an automated network intrusion detection system. You have collected a large number of records of network activities. Each record includes the log information about network activity, such as protocol types, duration, number of failed logins, which are random variables, denoted as $X = [X_{1}, X_{2}, \dots, X_{n}]$ . Each record also includes a binary random variable Y called label that was labelled by cybersecurity experts as intrusions $(Y = 1)$ or normal connections $(Y = 0)$ . Answer the following questions about Feature Selection Based on Mutual Information.

Explain to your colleague, who knows nothing about information theory, the concept of mutual information. Ans - Mutual information measures the amount of information that one random variable provides about another. It quantifies how much knowing one variable reduces uncertainty about the other. High mutual information indicated string dependence between variables, while low mutual information suggests independence.
Explain the loop in the pseudocode of Table 1.

1. 2. 3. 4. 5. 6. Initialize: Set F \leftarrow X and S \leftarrow \emptyset Select f_{max} = ar g X_{i} \in X max I (Y; X_{i}) Update F \leftarrow F ∖ {f_{max}}, S \leftarrow {f_{max}} Repeat until ∣ S ∣ = K : Select f_{max} = ar g X_{i} \in F max (I (Y; X_{i}) - X_{s} \in S \sum I (X_{s}; X_{i})) Update F \leftarrow F ∖ {f_{max}}, S \leftarrow S \cup {f_{max}}

Ans - The two lines in the loop are used to select $K$ features. In this loop, we find the feature fmax which achieves the maximum mutual information $I$ among all the remaining independent variables in set $F$ . However, some features may be highly correlated with each other, hence selecting them will increase the number of features without improving the prediction. Therefore, we make sure that there is minimal redundancy between the candidate feature $X_{i}$ and the set of selected features $S$ . That’s exactly what the second term on the RHS achieves in line 5. We then add this feature into the set $S$ and subtract it from set $F$ , and repeat until we got $K$ features.

Question 4

A company wants to classify whether a customer will purchase a product ( $Y$ = Yes) or not ( $Y$ = No) based on categorical features. You have access to the following dataset:

Age Group	Income Level	Prior Purchase	Purchase (Y)
Young	Low	No	No
Young	Low	Yes	No
Young	High	No	Yes
Middle-aged	Low	No	No
Middle-aged	High	No	Yes
Senior	Low	No	No
Senior	High	Yes	Yes
Senior	High	No	Yes

Calculate the entropy of the target variable $Y$ .

H (Y) = - i = 1 \sum m P (y_{i}) lo g_{2} P (y_{i}) = - (\frac{4}{8} lo g_{2} \frac{4}{8} + \frac{4}{8} lo g_{2} \frac{4}{8}) = 1

Compute the information gain for splitting on the feature “Prior Purchase.”
- For Prior Purchase = Yes : [No, Yes] $E n t ro p y (Y ∣ PP = Y es) = - (0.5) lo g_{2} (0.5) - (0.5) lo g_{2} (0.5) = 1$
- For Prior Purchase = No : [No, Yes, No, Yes, No, Yes] $E n t ro p y (Y ∣ PP = N o) = - (0.5) lo g_{2} (0.5) - (0.5) lo g_{2} (0.5) = 1$

W e i g h t e d E n t ro p y I G (Y, PP) = \frac{6}{8} \cdot 1.0 + \frac{2}{8} \cdot 1.0 = 1.0 = H (Y) - W e i g h t e d E n t ro p y = 1.0 - 1.0 = 0

If you were to build a decision tree, which feature would be the best root node? Justify your answer. The best root node should have the highest information gain. “Income Level” has maximal IG, so it should be selected as the root node.
Discuss how overfitting can be avoided in decision trees and suggest techniques to improve generalization.
- Pruning : Remove branches that add little classification value. Use a validation set to find how much to prune.
- Limiting Depth : Restrict maximum depth of the tree. Use a validation set to find when to stop growing the tree.
- Applying ensemble methods : Use bagging or boosting.

Ayush Acharjya's Notes

Explorer

Exercise 02

Question 1

Question 2

Question 3

Question 4

Graph View

Table of Contents

Backlinks